Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / productivity / Office

Batch Converting and Merging Multiple Documents to PDF Using Open Office and Ghostscript

4.63/5 (11 votes)
1 Jun 2008CPOL11 min read 1   689  
An example of how to use Open Office and Ghostscript to convert supported formats to PDF using Open Office and merge them using Ghostscript

Introduction

Using Open Office, it is possible to convert many kinds of documents to PDF in batch mode, including Microsoft Office Word, Excel and Powerpoint, plain text files, Open Office documents, JPEG, GIF, and many others. Using UNO (Unified Network Objects), the Open Office component model, a conversion process can connect to a running backroom instance of Open Office, load documents and save them as PDF. This article and accompanying Java program show how to accomplish this for multiple inputs and then optionally merge the results into a single PDF file using GPL Ghostscript.

Using the Code

The program is a command line Java program. It was developed and tested on Windows XP using the NetBeads IDE, and it was built and tested on Fedora Core 7 and Ubuntu 8.04 just using command line javac and CLASSPATH. The Sun JDK 1.6 was used in all cases. The program comes with a configuration file that must be bundled with the pdfcm package. (See the Requirements and Compilation Details section for details on compilation and requirements.

Usage: java -jar pdfcm.Main [-m mergeFile] [-d] file1 [file2 [file3...]]
Usage (jarfile): java -jar pdfcm.jar [-m mergeFile] [-d] file1 [file2 [file3...]]

    Converts all given input files to PDF.  Output filenames have the same base
    filenames as the input files and the extension "pdf".  PDF files on the
    input are not processed.

    INPUT OPTIONS

    -m mergeFile
        Causes converted PDF files and existing PDF (unprocessed) files on the
        input to be merged into a single PDF file given by mergeFile as a final
        step.

    -d
        Causes input files to be removed after successful processing.  When used
        in conjunction with the -m option, all intermediate files as well as any
        PDF files on the input are removed after successful processing.

    In case of a name collision between an input filename and the merge filename,
    in the case that the -d option is given, the collision will be resolved.  If
    the -d option is not given, then an error will be generated to prevent accidental
    overwrite of a file.

Points of Interest

The class that does the PDF conversion is the PDFConvert class. In a nutshell, it loops over input files, opens each file in batch mode using Open Office, and exports each file to PDF using the appropriate filter. Afterwards if the option is specified, it collects the resulting PDF files and merges them into one big happy PDF file using GPL Ghostscript. File types are identified by file name extensions only, and are mapped to Open Office filters in the configuration file.

PDF Conversion using Open Office

For the benefit of those readers cutting and pasting as they read, the following packages must be included.

Java
import com.sun.star.beans.PropertyValue;
import com.sun.star.uno.XComponentContext;
import com.sun.star.uno.UnoRuntime;
import com.sun.star.lang.XMultiComponentFactory;
import com.sun.star.frame.XComponentLoader;
import com.sun.star.frame.XStorable;
import com.sun.star.io.IOException;
import com.sun.star.util.XCloseable;
import com.sun.star.lang.XComponent;
import ooo.connector.BootstrapSocketConnector;

The conversion method has the signature public boolean DoConvert(String[] inputFiles) where the array inputFiles contains a list of input files given on the command line adjusted to include their full absolute paths. The first step is to establish a connection to Open Office. Before actually running the program, make sure that Open Office is installed correctly on the system, and verify that a document can be created using Open Office by you or the intended user. Make sure to click through any "first time use" registration screens before using the program.

The classes used to talk to Open Office are:

Java
// Declare Open Office components
XComponentContext xContext = null;
XMultiComponentFactory xMCF = null;
XComponentLoader xComponentLoader = null;
XStorable xStorable = null;
XCloseable xCloseable = null;
Object desktop = null;
Object document = null;

Details about each of the above components can be found in the Open Office SDK Documentation found at documentation.openoffice.org. For our purposes, it suffices to say that these objects encapsulate the API we need to talk to Open Office. In the next code snippet, the BootstrapSocketConnector is used to connect to Open Office. Open Office will be running in another process and listening on a socket, and the BootstrapSocketConnector takes care of the connection details. The BootStrapConnector is in a jarfile included with the source code with kind permission of Hol.Sten, a regular contributor in the Open Office forums. There are many excellent resources on connection alternatives in Open Office forums, but unless more than one instance of Open Office is going to be running, an Open Office instance on another server is being used, or if there are any other fancy requirements, nothing more than the bootstrap connector should be needed.

To follow the next code snippets, keep in mind that conversion status is kept in two properties on the PDFConvert object: StatusText and IsError. To keep track of non-fatal errors and status during loop processing, messages are further accumulated in a StringBuffer buf and stuffed into the StatusText at the end.

Java
// Try to get reference to an Open Office process
try {
    // Should use OO installation lib/programs directory on your system
    String ooLibFolder = ooLibPath;

    // Load the Open Office context
    xContext = BootstrapSocketConnector.bootstrap(ooLibFolder);

    // Load the Open Office object factory
    xMCF = xContext.getServiceManager();

    // Get a desktop instance
    desktop = xMCF.createInstanceWithContext(
                "com.sun.star.frame.Desktop", xContext);

    // Get a reference to the desktop interface that can load files
    xComponentLoader = (XComponentLoader) UnoRuntime.queryInterface(XComponentLoader.class, desktop);

    } catch (Exception ex) {
    // Open Office error
    statusText = "Could not get usable OpenOffice: " + ex.toString();
    isError = true;
    return false;
}

An exception thrown here is almost certainly due to misinstallation or misconfiguration of Open Office or the connection to Open Office. There's really nothing for the program to do here but report back that it couldn't connect to Open Office, and then Open Office can be reconfigured/reinstalled/rebooted, and you can try again. In order to debug Open Office connection problems, again make sure that the intended user can run Open Office offline and open the intended files without errors or popup registration windows.

The next bits of code are in the loop processing input files and also appear in try {...} catch {...} finally {...} blocks. In the following snippet, the file extension is in the String variable ext, and the input file names are in the array inputFiles indexed by i. When opening the file, we request that Open Office not open a document window since we are running in batch mode.

Java
// Set the document opener to not display an OO window
PropertyValue[] loaderValues = new PropertyValue[1];
loaderValues[0] = new PropertyValue();
loaderValues[0].Name = "Hidden";
loaderValues[0].Value = new Boolean(true);

Then we convert the current file name to the URI format required by Open Office.

Java
// Convert file path to URL name format and escape spaces
String docURL = "file:///" + inputFiles[i]
.replace(File.separatorChar, '/')
.replace(" ", "%20");
lastDot = docURL.lastIndexOf('.');

Next, the document is opened and an interface is obtained that can store the file using a filter appropriate to the kind of file it is. The program will then save the results to a file of the same name, but with the PDF extension, and append the filename to a converted file list. If the file is one of the "native" types, for example if it is already a PDF file, it just gets added directly to the list.

Java
// If it is already PDF, add it to the list of files to "converted" files
if (StringArrayContains(nativeTypes, ext)) {
    convertedFiles.add(docURL);
    } else {
    // Open the document in Open Office
    document = xComponentLoader.loadComponentFromURL(
    docURL, "_blank", 0, loaderValues);

    // Get a reference to the document interface that can store files
    xStorable = (XStorable) UnoRuntime.queryInterface(
    XStorable.class, document);

    // Set the arguments to save to PDF.
    PropertyValue[] saveArgs = new PropertyValue[2];
    saveArgs[0] = new PropertyValue();
    saveArgs[0].Name = "Overwrite";
    saveArgs[0].Value = new Boolean(true);

    // Choose appropriate output filter
    saveArgs[1] = new PropertyValue();
    saveArgs[1].Name = "FilterName";
    if (StringArrayContains(writerTypes, ext)) {
    saveArgs[1].Value = "writer_pdf_Export";
    } else if (StringArrayContains(calcTypes, ext)) {
    saveArgs[1].Value = "calc_pdf_Export";
    } else if (StringArrayContains(drawTypes, ext)) {
    saveArgs[1].Value = "draw_pdf_Export";
    } else {
    buf.append("File " + i + " has unknown extension: " + ext);
    isError = true;
    continue;  // Skip to the next file
}

// The converted file will have the same name with a PDF extension
String sSaveUrl = docURL.substring(0, lastDot) + ".pdf";

// Save the file
xStorable.storeToURL(sSaveUrl, saveArgs);

Various exceptions are handled, but it is always important to close a file when done, so code to do that goes into a finally block. Following recommended practice, the XCloseable interface is used to close the document. However, one notices that XClosable may be null or XCloseable.close() may throw an exception. In those cases, the exception is caught and the XComponent interface is used to explicitly dispose the object. The issue is that multiple clients may be using the same document. For example, if a document is being queued up for printing or a modal dialog is open somewhere, Open Office may throw a CloseVetoException on attempts to close the document. It does not overlap the use case here, so the document is just disposed.

Java
finally {
    // Make sure the file is closed before going to the next one
    if (document != null) {
        // Get a reference to the document interface that can close a file
        xCloseable = (XCloseable) UnoRuntime.queryInterface(
        XCloseable.class, document);

        // Try to close it or explicitly dispose it
        // See http://doc.services.openoffice.org/wiki/Documentation/
        //          DevGuide/OfficeDev/Closing_Documents
        if (xCloseable != null) {
        try {
            xCloseable.close(false);
        } catch (com.sun.star.util.CloseVetoException ex) {
        XComponent xComp = (XComponent) UnoRuntime.queryInterface(
        XComponent.class, document);
        xComp.dispose();
    }
    } else {
        XComponent xComp = (XComponent) UnoRuntime.queryInterface(
        XComponent.class, document);
        xComp.dispose();
    }
}
document = null;   // Javanauts, please pardon my CSharpery    

Merging Using Ghostscript

The converted files are kept in the array convertedFiles. To merge these, a command line is constructed to pass to the OS, so the file names have to be converted back to the native OS format. Note that it is OK to feed bare Postscript files to GPL Ghostscript as well as PDF files, so Postscript was included as a "native" file type. Because of the shenanigans surrounding how spaces in pathnames are handled between Windows and Unix, the code is not reproduced here.

Then a process is opened with the constructed command line and the program waits for it to finish. Note that the output and error streams need to be handled in order for the child process to finish cleanly. Though there may be more elegant means to do it, for this use case those streams are just closed and only the exit code is retained.

Java
try {
    // Execute the command
    Process mProc = Runtime.getRuntime().exec(cmd.toString());

    // Voodoo - In order to wait for an external process, you
    // have to handle its stdout(getInputStream) and stderr (getErrorStream)
    // I'm just going to close them as I'm only interested in if it succeeded or not
    InputStream iStr = mProc.getInputStream();
    iStr.close();
    InputStream eStr = mProc.getErrorStream();
    eStr.close();

    // Now wait
    int exCode = mProc.waitFor();
    if ( exCode == 0 ){
        buf.append("Merge succeeded: exit code was zero.");
    }
else {
        isError = true;
        buf.append("Merge failed: exit code was " + exCode);
    }

    } catch (java.io.IOException ex) {
    buf.append("Merge failed: " + ex.toString());
    isError = true;
    statusText = buf.toString();
    return false;
    } catch (java.lang.InterruptedException ex) {
    buf.append("Merge interrupted: " + ex.toString());
    isError = true;
    statusText = buf.toString();
    return false;
}

Note that Ghostscript is a venerable and well maintained program. If the program fails at this step, then it is again almost certainly because the GPL Ghostscript was misinstalled or misconfigured or because an input file was bad. As with Open Office, this can be checked offline by explicitly running GPL Ghostscript in a command shell on the suspect input, so there is no need to reproduce the exact error messages.

Other Program Details

The -d feature causes deletion of input or intermediate files on completion. This is useful for backroom processing situations where you don't want clutter to accumulate or you don't want to worry about sensitive information being left around inadvertently. For this feature and other program features described in the usage text, please refer to the Main source code and the usage text.

Requirements and Compilation Details

Open Office

Open Office can be obtained from www.openoffice.org. This program has been tested with Open Office versions 2.3.0 on Windows XP, 2.4.0 on Fedora Core 7, and 2.4.0 on Ubuntu 8.04. Previous versions of Open Office have the ability to create PDF, and as long as some flavor of Open Office version 2.X is being used, the connection mechanism used in this code should be supported. However, this code has not been built or tested against previous Open Office versions. (See threads referenced in this thread at the Open Office forums site for more information.)

GPL Ghostscript

GPL Ghostscript can be obtained at pages.cs.wisc.edu/~ghost. This code has been tested with Ghostscript versions 8.6.1 on Windows, 8.6.2 on Fedora Core 7, and 8.6.2 on Ubuntu 8.04. Previous versions of GPL Ghostscript should work as long as they support the command line options -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite. Note: ESP Ghostscript version 8.15, which comes standard with some flavors of Linux, did not work in this capacity even though it supports the above options. See further discussion below.

bootstrapconnector.jar

The bootstrapconnector.jar is a simplified mechanism for connecting to an Open Office instance created by Hol.sten at the Open Office developers' forums. It is included here with the source code for convenience, but is ultimately available from this thread.

Setting up your Build Environment

The code was compiled and tested using Sun JDK 1.6 on both Fedora Core 7, Ubuntu 8.04, and Windows XP SP2. It will not compile nor work with non Java 1.5 compliant versions of GCJ. Make sure that the CLASSPATH or IDE build environment includes the Open Office jarfiles juh.jar, jurt.jar, jut.jar, ridl.jar, unoil.jar, and officebean.jar. Usually, these jarfiles may be found in the subdirectory program/classes of your Open Office. And of course, the bootstrapconnector.jar must be included on the CLASSPATH.

Configuration

There are several configuration constants that need to be set for your system in the config.properties file.

  • ooLibPath is the location of your Open Office installation, and a folder named classes should be in that folder. (E.g. This is C:\Program Files\OpenOffice.org 2.3\program on a Windows system with a standard 2.3.0 install.)
  • gsExePath is the directory of the GPL Ghostscript executable. (E.g. This is C:\Program Files\gs\gs8.61\bin on a Windows system with a standard install.)
  • gsExeName is the name of the Ghostscript executable. (On Windows, it has to be the one of the two shipped executables that runs in a console window, so this is gswin32c.exe on a Windows system. On UNIX systems, it is usually just gs.)
  • shellCommandStyle can choose between one of two hacks for building the GPL Ghostscript command line. doubleQuoted will cause it to be built to handle spaces in filenames and paths using double quotes appropriate for Microsoft Windows, and escapeSpaces will cause it to be built to handle spaces Unix style with backslashes.
  • writerTypes is a list of file extensions, comma separated in lower case including the leading '.', that are expected to be handled by Open Office Writer.
  • calcTypes is a list of file extensions, comma separated in lower case including the leading '.', that are expected to be handled by Open Office Calc.
  • drawTypes is a list of file extensions, comma separated in lower case including the leading '.', that are expected to be handled by Open Office Draw. This category seems to include presentation types for the moment, so there is no separate category for Open Office Impress.
  • nativeTypes is a list of file extensions, comma separated in lower case including the leading '.', that are either already PDF or can be handled directly by GPL Ghostscript, such as Postscript.

There are probably many more, but these are the popular ones.

Discussion

As noted above, the merge process fails when ESP Ghostscript 8.15 is used. The symptom is that for several test files, ESP Ghostscript begins to peg the CPU at 100% usage and makes very slow progress creating output, only a few bytes per second on a 2.7 GHz dual core machine. The GPL Ghostscript codebase was merged with ESP Ghostscript 8.15 at GPL version 8.57, and so GPL 8.61 can be regarded as an updated product anyway.

Beware of the difference between Windows and Unix style line endings when processing plain text files. Although there are no problems within a single platform, text files created on Windows with the CRLF line endings will try to open as Calc files using Open Office on Unix. In interactive mode, this has the further joyful effect of popping a dialog box to inquire about import parameters, and in batch mode it causes the XStorable interface above to be null during file processing.

A few corners were cut in this example. Nonetheless, this program is being used in production for backroom processing. In particular, for general use:

  • The handling of filenames should be tightened up to account for other special characters and whitespace.
  • The list of handled file types should be expanded and tested.
  • The error handling and reporting should be more informative.

However, the ROI on this code is at the point of diminishing returns. The program itself is just glue and the true usefulness lies in Open Office and Ghostscript.

History

  • 1st June, 2008: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)