Introduction
In my travels as a developer, I came across the need to output a document to PDF. My first thought was to use a tool for PDF writing. The problem was that I didn’t want to spend all of that time developing and updating a PDF template that I could modify with code. I wanted the end user to be able to modify the template as needed. Now I could create some sort of complicated system to allow them to do so or I could give them an editor, but neither option appealed to me. Aren’t custom applications supposed to make life easier?
In the end, my solution seemed to be a simple one: use a Microsoft Word document as the template, write to it with C#, and programmatically save it to PDF using the built-in tool included with Word. Since Microsoft controls both Word and the .NET environment, I assumed life would be good. That isn’t quite the case. There are a few pitfalls that you have to be aware of even when using .NET 4.0 and Microsoft Word 2010. In the end, however, I have exactly what I was looking for: a simple solution for the end user and a powerful, extensible solution that I can use for multiple different applications on the back end.
Solution Overview
For those of you who like a quick list of what I plan to do in the code, here will be our steps:
- Connect to Microsoft Word 2010 via the
Microsoft.Office.Interop.Word
.NET component.
- Loop through the document to find the user-provided key words to replace – replace each with its given counterpart.
- Save the document as a PDF.
Simple, right? The devil is definitely in these details.
Problems Encountered
Whenever you deal with a bridge between systems, you can expect to come across at least one issue. You would think that this wouldn’t be as much of an issue when talking about two Microsoft systems but unfortunately this is not the case.
The biggest issue ends up being with which version of the Interop library that you use. There are two provided – one is a .NET component and one is a COM component. The first thing I found out about these two is that while I believe both use a COM wrapper, the COM component seems to be buggier than the .NET component. However, both have issues. Since this system uses a COM wrapper, the Word process doesn’t always get the message that the system is done. In a worst case, you can actually get multiple instances of winword.exe running at once, even if you properly close out and destroy your variables after use. I came across a few different “solutions” for this issue.
Solution one stipulates that the system has closed the objects but the garbage collector has not run yet, thus the objects still exist in memory. The thinking, therefore, is that you should call the garbage collector manually. For some reason, since it doesn’t work the first time, the suggestion is actually to call it twice. Here is the suggested code:
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
GC.WaitForPendingFinalizers();
I tried this using one thread. It doesn’t work. Instead, my system locks up until the winword.exe processes are all released. The processes don’t release any faster than if I did nothing. The only change is that my application locks up. Worse yet, we manually called the garbage collector. This throws out all of the optimizations that the garbage collector has made. For more thoughts on the Garbage Collector and why you shouldn’t do this, Jim Lyon has written a pretty good article about it here.
There are a few more “rumor” solutions floating out there. The funniest one, I thought, was by a guy who figured out how to get into the process list and kill off all of the winword.exe processes that were currently running. I’ll give him points for style but the person who is trying to write a Word document while running this program might just have something to say about this hack “outside the box” solution (and no, I’m not posting the code here). The end result might be our only recourse in the case of a hung process but we really want to avoid this if at all possible.
So basically, we are left with an issue. How do we kill winword.exe, or better yet, how do we stop it from hanging. After much hard work (ok, so maybe it was just a few random guesses), I’ve developed a list of “best practices” for dealing with the Interop for Microsoft Office tools (yes, this includes Excel and PowerPoint).
Best Practices
The first thing we need to note is that the system works. OK, maybe it doesn’t work the way we want it to but that is because we are control freaks. We want to optimize the system. Each byte of memory needs to be controlled from start to finish. Let it go. Let the system do what it needs to do without trying to control it. Rabidly calling the garbage collector, killing off processes like a digital psycho, or other methods of exerting your control will only make the system mad.
The next thing to do is control what is in your power to control (this is starting to sound like the Serenity Prayer). Before you attempt to open a document, make sure it exists first. If at all possible, try to be sure that the document is the right type (manually, although there is a way to do it programmatically). This includes both extension and Office version (in case you are using the Office 2007 component and you come across a 2010 file for example). Once you have confirmed all of these details, make sure your code is optimized so you aren’t driving the component nuts with all of your calls. Finally, don’t open or close the actual application more than necessary. If you think you are going to need to use the application multiple times throughout its lifespan, keep the application object open. Maybe make it a (gasp) global variable.
The final best practice I would stress is to know what is going on. That sounds obvious but sometimes things happen. Make sure you know when the component is being called initially and make sure you know when it is being closed. Check to be sure that the destructor statement for the component is properly set in the finally
block of any try
/catch
. Walk through the code to be sure things are happening as you expect them to happen. I’ve seen a lot of people blame the easy target only to find out later that the problem was a simple coding mistake. Not that we have ever done something like that but other people might need to know that.
The Code
So, we know that we can use the Microsoft Word .NET component and we know how to use it safely. The question that has to be on your mind now is what cool things can we do with this new-found power? In this article, I want to show how we can use Microsoft Word and the Save As PDF function to create amazing Word templates without using bookmarks or other advanced Microsoft Word items. There are many more things that Word automation can be used for but this practical example will give you a taste for the power available to you and it will provide an answer to our original problem.
I decided that instead of giving you multiple snippets of code, I would instead document my code well and present it here. That way you can copy and paste both the code and the documentation into your own application. Here is the class that does all of the heavy lifting:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Word = Microsoft.Office.Interop.Word;
using System.Reflection;
namespace AutoWord
{
public static class Document
{
public static bool Process(string strWordDoc, string strPDFDoc,
Dictionary<string,string> dReplacements)
{
object oMissing = System.Reflection.Missing.Value;
object oFalse = false;
object oTrue = true;
bool bolOutput = true;
Word._Application oWord;
Word._Document oDoc;
if (!System.IO.File.Exists(strWordDoc))
{
Console.WriteLine("The file does not exist on the path specified.");
return false;
}
try
{
oWord = new Word.Application();
oWord.Visible = true;
oDoc = oWord.Documents.Open(strWordDoc, oFalse, oTrue);
foreach (Word.Range oRange in oDoc.StoryRanges)
{
foreach (KeyValuePair<string,string> dEntry in dReplacements)
{
oRange.Find.Text = dEntry.Key.ToString();
oRange.Find.Replacement.Text = dEntry.Value.ToString();
oRange.Find.Wrap = Word.WdFindWrap.wdFindContinue;
oRange.Find.Execute(Replace: Word.WdReplace.wdReplaceAll);
}
}
oDoc.ExportAsFixedFormat(strPDFDoc,
Word.WdExportFormat.wdExportFormatPDF);
oDoc.Close(oFalse, oMissing, oMissing);
oWord.Quit(oFalse, oMissing, oMissing);
bolOutput = true;
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
bolOutput = false;
}
finally
{
oDoc = null;
oWord = null;
}
return bolOutput;
}
}
}
Example Code
This code is really simple to use. Basically there is only one method call that you need to make and you are done. For those of you who might not have used a dictionary
object before or don’t understand how it would be used in this instance, I will include the creation and use of the dictionary
object:
Dictionary<string,string> dKeywords = new Dictionary<string,string>();
dKeywords.Add("<<Title>>", "PDF Creation Tool");
dKeywords.Add("<<Name>>","Timothy Corey");
dKeywords.Add("<<Email>>", "me@timothycorey.com");
dKeywords.Add("<<Website>>", "www.timothycorey.com");
AutoWord.Document.Process(@"C:\Temp\MyDocument.docx",@"C:\Temp\Portfolio.pdf",dKeywords);
While I chose a specific naming convention for my tags, this system will find any string
that you specify and replace it with the value member of the dictionary
object. Notice the actual method call to AutoWord.Document.Process
asks for the Word template, the PDF file you want it to be saved as, and the items to replace (the dictionary
object). I used the verbatim string
literals (the @
symbol before the string
) so that I did not need to escape my slashes since they would normally be interpreted as escape characters themselves. Thus, instead of putting “C:\\Temp\\MyDocument.docs” I was able to put @”C:\Temp\MyDocument.docx”. Both mean the same thing.
Uses
One of the things I always have to ask myself when I see something cool is “why do I need it”. Now in the case of free code, the answer can be “just because” but I think there are some really great reasons to use this code. The first area that I see where this can be really powerful is in the area of account creation, account maintenance, or other user-specific operations. You can create pre-formatted templates that you then fill in with the user’s specific information. From there, you could email it to the user or put the file in their shared drive.
Another great way to use this is with record storage. You could have a system that automatically fills out a usage report (or some other type of report) and stores it for you. This way, you could have an entirely automated system that does the job and reports on itself as well.
Quick Notes
This solution was designed using .NET 4.0 and Microsoft Word 2010. However, this can be used without modification in .NET 3.5 and Microsoft Word 2007. I believe most of the same functionality resides in Microsoft Word 2003 but I haven’t tested it all to see how close it is. Also note that while this article was about how to create PDFs using a Microsoft Word document as a template, you could use some of the same techniques with Microsoft Excel and Microsoft PowerPoint. The power available to you is incredible.
FAQs
- Why didn’t you use the Microsoft Word bookmarks?
Good question. I’m glad you asked. Basically, we could so something very similar here using bookmarks. In C#, we could fill each bookmark with a value and be good to go. I decided not to use them because I had the option not to and I thought it would be simpler to skip them. When you are doing merging in Word, you don’t have the option not to use bookmarks. However, they can be a bit complicated for the end user. What is worse (in my opinion), you can’t use the same bookmark name in two places. The problem with this is when you want to put the same information in two places on one document. You end up with bookmarks named things like “name1
”, “name2
”, etc. Since I already had the power to manipulate the document, I decided to skip this system and roll my own. My system just finds a string
match and replaces it. That means I can put a tag like “<>
” in five places and it will replace all five with the same value. That tag format is up to me as well. I could decide to make it “**name**
” instead. The other thing this does is it opens up our class for other uses as well. Say, for example that your company wants to publish a PDF of a sensitive document. They want to redact every instance of their name and replace it with asterisks instead. No problem. Just feed the company name in as the tag name and make the value asterisks. The system will process the entire document within seconds and output it for you without changing the original document.
- What happens if the winword.exe process doesn’t close?
The first thing I would suggest is that you give it a few minutes. Sometimes it takes a few minutes to close (not cool, I know but it isn’t something I can help at this time). If the process(es) still don’t close out by themselves, check your code for errors. Make sure the document is the right version, that it has the proper tags, and that you have permission to save the PDF file to the given location. An error in the winword.exe process won’t be passed back (we make a blind call to Word) which means an error will hang the process. As part of this diagnostic process, run the Word application in visible mode (see the Boolean variable in the code to set this). That might show you an error you missed. If you really can’t figure out why it is hanging, and you are sure it isn’t an error, you may have to work on a way to kill the process. I hate saying that because it is giving up. I don’t think you should even consider this until you have exhausted all other options. However, there may be an instance where you need to do this. It really isn’t hard. Basically, you need to loop through the processes and kill any with the name winword.exe. You can get more exact by capturing which process specifically your code creates and then only killing that process. This is what I would recommend.
- Why should I use this over (insert tool name here)?
My philosophy is to use the tools I have first before I look for new ones to do a task. If you have Word already, why not use it? Also, I really don’t like making things complicated. If I have to learn something new to do something as simple as create a PDF, I think there is a problem. Most applications come with template editors or some method of creating the report template. They brag about how they are easy to use. Great, but my users already know how to use Microsoft Word so I would argue that this method has the best template editor of them all.
- What if I don’t have Microsoft Word on the server?
First of all, this solution is best for a desktop application rather than a server-based solution. However, if you really want to use this somewhere that Microsoft Word is not installed, then you are going to need to look into using the System.IO.Package.IO
namespace to manipulate the document without using Microsoft Word itself. Unfortunately, this will leave you with a Microsoft Word document and not a PDF (whoops). To get it to PDF will then require a tool to convert the document. So we end up back where we started. If you simply cannot install Microsoft Word on the server (and I understand if you cannot), I would suggest creating a web service on another server that has Word. This would allow you to safely create PDF documents in a server-based environment.
- Aren’t there a lot of articles on Word automation in C#? Why create a new one?
I thought about this before writing this article. There are a lot of people out there that have done some great work in Word automation. However, I kept coming across partial solutions. It is great that you can manipulate Microsoft Word documents from C# but why? I didn’t want to do a theoretical exercise. I wanted a working, valuable solution. I think that the creation of PDF documents from a template on the fly meets that criteria. I also wanted to share my experiences with the pitfalls of Word automation since there seems to be a lot of confusion out there on how best to handle these issues.
Conclusion
So, in this article, we have discussed how to create a PDF in C# without spending any extra money and without using any special report designers. I hope you have enjoyed this code as much as I enjoyed writing it. I have attached a fully-working solution that will allow you to test out this functionality. Let me know what you think below.
History
- January 1, 2011 – Initial version