Introduction
Using Open Office, it is possible to convert many kinds of documents to PDF in batch mode, including Microsoft Office Word, Excel and Powerpoint, plain text files, Open Office documents, JPEG, GIF, and many others. Using UNO (Unified Network Objects), the Open Office component model, a conversion process can connect to a running backroom instance of Open Office, load documents and save them as PDF. This article and accompanying Java program show how to accomplish this for multiple inputs and then optionally merge the results into a single PDF file using GPL Ghostscript.
Using the Code
The program is a command line Java program. It was developed and tested on Windows XP using the NetBeads IDE, and it was built and tested on Fedora Core 7 and Ubuntu 8.04 just using command line javac
and CLASSPATH
. The Sun JDK 1.6 was used in all cases. The program comes with a configuration file that must be bundled with the pdfcm package. (See the Requirements and Compilation Details section for details on compilation and requirements.
Usage: java -jar pdfcm.Main [-m mergeFile] [-d] file1 [file2 [file3...]]
Usage (jarfile): java -jar pdfcm.jar [-m mergeFile] [-d] file1 [file2 [file3...]]
Converts all given input files to PDF. Output filenames have the same base
filenames as the input files and the extension "pdf". PDF files on the
input are not processed.
INPUT OPTIONS
-m mergeFile
Causes converted PDF files and existing PDF (unprocessed) files on the
input to be merged into a single PDF file given by mergeFile as a final
step.
-d
Causes input files to be removed after successful processing. When used
in conjunction with the -m option, all intermediate files as well as any
PDF files on the input are removed after successful processing.
In case of a name collision between an input filename and the merge filename,
in the case that the -d option is given, the collision will be resolved. If
the -d option is not given, then an error will be generated to prevent accidental
overwrite of a file.
Points of Interest
The class that does the PDF conversion is the PDFConvert
class. In a nutshell, it loops over input files, opens each file in batch mode using Open Office, and exports each file to PDF using the appropriate filter. Afterwards if the option is specified, it collects the resulting PDF files and merges them into one big happy PDF file using GPL Ghostscript. File types are identified by file name extensions only, and are mapped to Open Office filters in the configuration file.
PDF Conversion using Open Office
For the benefit of those readers cutting and pasting as they read, the following packages must be included.
import com.sun.star.beans.PropertyValue;
import com.sun.star.uno.XComponentContext;
import com.sun.star.uno.UnoRuntime;
import com.sun.star.lang.XMultiComponentFactory;
import com.sun.star.frame.XComponentLoader;
import com.sun.star.frame.XStorable;
import com.sun.star.io.IOException;
import com.sun.star.util.XCloseable;
import com.sun.star.lang.XComponent;
import ooo.connector.BootstrapSocketConnector;
The conversion method has the signature public boolean DoConvert(String[] inputFiles)
where the array inputFiles
contains a list of input files given on the command line adjusted to include their full absolute paths. The first step is to establish a connection to Open Office. Before actually running the program, make sure that Open Office is installed correctly on the system, and verify that a document can be created using Open Office by you or the intended user. Make sure to click through any "first time use" registration screens before using the program.
The classes used to talk to Open Office are:
XComponentContext xContext = null;
XMultiComponentFactory xMCF = null;
XComponentLoader xComponentLoader = null;
XStorable xStorable = null;
XCloseable xCloseable = null;
Object desktop = null;
Object document = null;
Details about each of the above components can be found in the Open Office SDK Documentation found at documentation.openoffice.org. For our purposes, it suffices to say that these objects encapsulate the API we need to talk to Open Office. In the next code snippet, the BootstrapSocketConnector is used to connect to Open Office. Open Office will be running in another process and listening on a socket, and the BootstrapSocketConnector takes care of the connection details. The BootStrapConnector is in a jarfile included with the source code with kind permission of Hol.Sten, a regular contributor in the Open Office forums. There are many excellent resources on connection alternatives in Open Office forums, but unless more than one instance of Open Office is going to be running, an Open Office instance on another server is being used, or if there are any other fancy requirements, nothing more than the bootstrap connector should be needed.
To follow the next code snippets, keep in mind that conversion status is kept in two properties on the PDFConvert
object: StatusText
and IsError
. To keep track of non-fatal errors and status during loop processing, messages are further accumulated in a StringBuffer buf
and stuffed into the StatusText
at the end.
try {
String ooLibFolder = ooLibPath;
xContext = BootstrapSocketConnector.bootstrap(ooLibFolder);
xMCF = xContext.getServiceManager();
desktop = xMCF.createInstanceWithContext(
"com.sun.star.frame.Desktop", xContext);
xComponentLoader = (XComponentLoader) UnoRuntime.queryInterface(XComponentLoader.class, desktop);
} catch (Exception ex) {
statusText = "Could not get usable OpenOffice: " + ex.toString();
isError = true;
return false;
}
An exception thrown here is almost certainly due to misinstallation or misconfiguration of Open Office or the connection to Open Office. There's really nothing for the program to do here but report back that it couldn't connect to Open Office, and then Open Office can be reconfigured/reinstalled/rebooted, and you can try again. In order to debug Open Office connection problems, again make sure that the intended user can run Open Office offline and open the intended files without errors or popup registration windows.
The next bits of code are in the loop processing input files and also appear in try {...} catch {...} finally {...}
blocks. In the following snippet, the file extension is in the String
variable ext
, and the input file names are in the array inputFiles
indexed by i
. When opening the file, we request that Open Office not open a document window since we are running in batch mode.
PropertyValue[] loaderValues = new PropertyValue[1];
loaderValues[0] = new PropertyValue();
loaderValues[0].Name = "Hidden";
loaderValues[0].Value = new Boolean(true);
Then we convert the current file name to the URI format required by Open Office.
String docURL = "file:///" + inputFiles[i]
.replace(File.separatorChar, '/')
.replace(" ", "%20");
lastDot = docURL.lastIndexOf('.');
Next, the document is opened and an interface is obtained that can store the file using a filter appropriate to the kind of file it is. The program will then save the results to a file of the same name, but with the PDF extension, and append the filename to a converted file list. If the file is one of the "native" types, for example if it is already a PDF file, it just gets added directly to the list.
if (StringArrayContains(nativeTypes, ext)) {
convertedFiles.add(docURL);
} else {
document = xComponentLoader.loadComponentFromURL(
docURL, "_blank", 0, loaderValues);
xStorable = (XStorable) UnoRuntime.queryInterface(
XStorable.class, document);
PropertyValue[] saveArgs = new PropertyValue[2];
saveArgs[0] = new PropertyValue();
saveArgs[0].Name = "Overwrite";
saveArgs[0].Value = new Boolean(true);
saveArgs[1] = new PropertyValue();
saveArgs[1].Name = "FilterName";
if (StringArrayContains(writerTypes, ext)) {
saveArgs[1].Value = "writer_pdf_Export";
} else if (StringArrayContains(calcTypes, ext)) {
saveArgs[1].Value = "calc_pdf_Export";
} else if (StringArrayContains(drawTypes, ext)) {
saveArgs[1].Value = "draw_pdf_Export";
} else {
buf.append("File " + i + " has unknown extension: " + ext);
isError = true;
continue;
}
String sSaveUrl = docURL.substring(0, lastDot) + ".pdf";
xStorable.storeToURL(sSaveUrl, saveArgs);
Various exceptions are handled, but it is always important to close a file when done, so code to do that goes into a finally
block. Following recommended practice, the XCloseable
interface is used to close the document. However, one notices that XClosable
may be null
or XCloseable.close()
may throw an exception. In those cases, the exception is caught and the XComponent
interface is used to explicitly dispose the object. The issue is that multiple clients may be using the same document. For example, if a document is being queued up for printing or a modal dialog is open somewhere, Open Office may throw a CloseVetoException
on attempts to close the document. It does not overlap the use case here, so the document is just disposed.
finally {
if (document != null) {
xCloseable = (XCloseable) UnoRuntime.queryInterface(
XCloseable.class, document);
if (xCloseable != null) {
try {
xCloseable.close(false);
} catch (com.sun.star.util.CloseVetoException ex) {
XComponent xComp = (XComponent) UnoRuntime.queryInterface(
XComponent.class, document);
xComp.dispose();
}
} else {
XComponent xComp = (XComponent) UnoRuntime.queryInterface(
XComponent.class, document);
xComp.dispose();
}
}
document = null;
Merging Using Ghostscript
The converted files are kept in the array convertedFiles
. To merge these, a command line is constructed to pass to the OS, so the file names have to be converted back to the native OS format. Note that it is OK to feed bare Postscript files to GPL Ghostscript as well as PDF files, so Postscript was included as a "native" file type. Because of the shenanigans surrounding how spaces in pathnames are handled between Windows and Unix, the code is not reproduced here.
Then a process is opened with the constructed command line and the program waits for it to finish. Note that the output and error streams need to be handled in order for the child process to finish cleanly. Though there may be more elegant means to do it, for this use case those stream
s are just closed and only the exit code is retained.
try {
Process mProc = Runtime.getRuntime().exec(cmd.toString());
InputStream iStr = mProc.getInputStream();
iStr.close();
InputStream eStr = mProc.getErrorStream();
eStr.close();
int exCode = mProc.waitFor();
if ( exCode == 0 ){
buf.append("Merge succeeded: exit code was zero.");
}
else {
isError = true;
buf.append("Merge failed: exit code was " + exCode);
}
} catch (java.io.IOException ex) {
buf.append("Merge failed: " + ex.toString());
isError = true;
statusText = buf.toString();
return false;
} catch (java.lang.InterruptedException ex) {
buf.append("Merge interrupted: " + ex.toString());
isError = true;
statusText = buf.toString();
return false;
}
Note that Ghostscript is a venerable and well maintained program. If the program fails at this step, then it is again almost certainly because the GPL Ghostscript was misinstalled or misconfigured or because an input file was bad. As with Open Office, this can be checked offline by explicitly running GPL Ghostscript in a command shell on the suspect input, so there is no need to reproduce the exact error messages.
Other Program Details
The -d
feature causes deletion of input or intermediate files on completion. This is useful for backroom processing situations where you don't want clutter to accumulate or you don't want to worry about sensitive information being left around inadvertently. For this feature and other program features described in the usage text, please refer to the Main
source code and the usage text.
Requirements and Compilation Details
Open Office
Open Office can be obtained from www.openoffice.org. This program has been tested with Open Office versions 2.3.0 on Windows XP, 2.4.0 on Fedora Core 7, and 2.4.0 on Ubuntu 8.04. Previous versions of Open Office have the ability to create PDF, and as long as some flavor of Open Office version 2.X is being used, the connection mechanism used in this code should be supported. However, this code has not been built or tested against previous Open Office versions. (See threads referenced in this thread at the Open Office forums site for more information.)
GPL Ghostscript
GPL Ghostscript can be obtained at pages.cs.wisc.edu/~ghost. This code has been tested with Ghostscript versions 8.6.1 on Windows, 8.6.2 on Fedora Core 7, and 8.6.2 on Ubuntu 8.04. Previous versions of GPL Ghostscript should work as long as they support the command line options -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite
. Note: ESP Ghostscript version 8.15, which comes standard with some flavors of Linux, did not work in this capacity even though it supports the above options. See further discussion below.
bootstrapconnector.jar
The bootstrapconnector.jar is a simplified mechanism for connecting to an Open Office instance created by Hol.sten at the Open Office developers' forums. It is included here with the source code for convenience, but is ultimately available from this thread.
Setting up your Build Environment
The code was compiled and tested using Sun JDK 1.6 on both Fedora Core 7, Ubuntu 8.04, and Windows XP SP2. It will not compile nor work with non Java 1.5 compliant versions of GCJ. Make sure that the CLASSPATH
or IDE
build environment includes the Open Office jarfiles juh.jar, jurt.jar, jut.jar, ridl.jar, unoil.jar, and officebean.jar. Usually, these jarfiles may be found in the subdirectory program/classes of your Open Office. And of course, the bootstrapconnector.jar must be included on the CLASSPATH
.
Configuration
There are several configuration constants that need to be set for your system in the config.properties file.
- ooLibPath is the location of your Open Office installation, and a folder named classes should be in that folder. (E.g. This is C:\Program Files\OpenOffice.org 2.3\program on a Windows system with a standard 2.3.0 install.)
- gsExePath is the directory of the GPL Ghostscript executable. (E.g. This is C:\Program Files\gs\gs8.61\bin on a Windows system with a standard install.)
- gsExeName is the name of the Ghostscript executable. (On Windows, it has to be the one of the two shipped executables that runs in a console window, so this is gswin32c.exe on a Windows system. On UNIX systems, it is usually just gs.)
- shellCommandStyle can choose between one of two hacks for building the GPL Ghostscript command line. doubleQuoted will cause it to be built to handle spaces in filenames and paths using double quotes appropriate for Microsoft Windows, and escapeSpaces will cause it to be built to handle spaces Unix style with backslashes.
- writerTypes is a list of file extensions, comma separated in lower case including the leading '.', that are expected to be handled by Open Office Writer.
- calcTypes is a list of file extensions, comma separated in lower case including the leading '.', that are expected to be handled by Open Office Calc.
- drawTypes is a list of file extensions, comma separated in lower case including the leading '.', that are expected to be handled by Open Office Draw. This category seems to include presentation types for the moment, so there is no separate category for Open Office Impress.
- nativeTypes is a list of file extensions, comma separated in lower case including the leading '.', that are either already PDF or can be handled directly by GPL Ghostscript, such as Postscript.
There are probably many more, but these are the popular ones.
Discussion
As noted above, the merge process fails when ESP Ghostscript 8.15 is used. The symptom is that for several test files, ESP Ghostscript begins to peg the CPU at 100% usage and makes very slow progress creating output, only a few bytes per second on a 2.7 GHz dual core machine. The GPL Ghostscript codebase was merged with ESP Ghostscript 8.15 at GPL version 8.57, and so GPL 8.61 can be regarded as an updated product anyway.
Beware of the difference between Windows and Unix style line endings when processing plain text files. Although there are no problems within a single platform, text files created on Windows with the CRLF line endings will try to open as Calc files using Open Office on Unix. In interactive mode, this has the further joyful effect of popping a dialog box to inquire about import
parameters, and in batch mode it causes the XStorable
interface above to be null
during file processing.
A few corners were cut in this example. Nonetheless, this program is being used in production for backroom processing. In particular, for general use:
- The handling of filenames should be tightened up to account for other special characters and whitespace.
- The list of handled file types should be expanded and tested.
- The error handling and reporting should be more informative.
However, the ROI on this code is at the point of diminishing returns. The program itself is just glue and the true usefulness lies in Open Office and Ghostscript.
History
- 1st June, 2008: Initial post