Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Merging PDF Documents: How to Ensure Accuracy and Efficiency

12 Sep 2018 1  
In this article, we will discuss merging PDF documents at the enterprise level. We will cover the top advantages and the common pitfalls of merging PDF documents, and key tips on how to improve your merge process.

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

Why Merge PDFs?

In today’s enterprise workflows, PDF files are commonly merged in order to streamline a variety of different tasks. Examples of merged documents can range from mortgage documents, loans, invoices, credit card statements, cell phone bills, HR documents (such as healthcare benefit information and 401(k) statements), and so many more. One commonly merged document that we are familiar with is a bank or credit card statement. Specific sections of each statement are created using a template that is then filled with an individual’s personal information. The result is a personalized statement that is merged with a standard disclaimer section and the financial institution’s header page, which may also include their logo and other corporate templates.

Example of a Merged Statement

[Source: http://resumepdf.com/6-chase-bank-statements/]

Companies often need to merge many documents and different data sources into a single PDF document, which may involve multiple merging steps. Let’s consider the statement above. Each statement is created by merging different documents together. Now, let’s imagine that the user is requesting they see a 1-year history of their statements. The financial institution has the option to provide the user with individual files or create a 1-year report by merging disparate parts of all the user’s monthly statements together into one master document. The same thing is true when financial institutions are preparing their customer statements for printing and mailing. Individual client statements are combined into large print runs that are tens of thousands of pages long.

Merging PDF documents is a very mission critical process in the enterprise world today. Many businesses and organizations are reliant on merging to streamline their workflows and processes which would otherwise require exponential cost and resources to achieve the same results and output manually. Because of this, it’s extremely important to ensure it is done accurately.

Document Merging Pitfalls – What You Need to be Aware Of

Merging PDF documents is an indispensable process, and businesses rely so heavily on it because it saves them time, effort, and money. However, if the tools being used are not built to perform at the enterprise level, things can often go wrong, and the end results might not be what you expect.

Excessively Large PDF Files

Document Merged with Adobe PDF Library

Document Merged with Another PDF Tool

As you may have guessed, one of the most common problems with PDF document merging is excessive file size. Please note that the example PDF shown above (which was created by merging 100 1-page files into a single combined PDF) was merged using two separate tools – one was merged using the Adobe PDF Library, and the other was merged with another PDF tool. As you can clearly see, the document size of the PDF merged with the other PDF tool is almost 30 times larger.

The easiest way to merge two PDF files (and this is simplifying things a bit) is to fuse them together. This means that contents of File B are added after the contents of File A. The total file size of the resulting file will be the sum of the file sizes of A and B. However, this doesn’t have to be the case. In a proper PDF file merge, the resulting file will be smaller than A + B. This is primarily because PDF files have complex infrastructures underneath the hood. PDFs are often composed of text runs and different resources such as fonts, images, color spaces, etc. When merging files, a well-built merge application will examine each shared resource. It will determine if the end result of a merge contains duplicate resources and will automatically eliminate the duplicates. For example, if there are two copies of the same font, the second copy (at least in concept) should be eliminated. This also applies to images and other resources. In any consumer-facing content, companies will often use their logo on statements. If we merge two statements together and each one uses the same logo, this logo is considered a common shared document resource. Applications that just fuse the documents together will leave multiple copies of the same logo within the document, leading to excessive file size for the resulting “merged” PDF. The application that is being used for merging the pages needs to make sure the resulting document has only one unique version of the logo, and that each instance points to it. The Adobe PDF Library takes care of this scenario behind the scenes. You too can achieve dramatically smaller merged PDF files by downloading a free evaluation of the Adobe PDF Library from our website and implementing it in your merging process.

Inefficient PDF Data Structure

Documents that have been created through an incorrectly executed merge process can often experience performance problems. Excessive file size, which we talked about previously, is often one of the primary causes. But, another big problem that can cause performance issues is inefficient data structures in the merged PDF. A proper merge process needs to optimize the structure of the page tree to ensure that the resulting PDF has efficient page access. Without efficient page access, users may experience noticeable delays navigating the pages within the resulting document.

Removing Elements that Should Not be Removed

Incorrect removal of duplicate resources, and improper cleaning up of the resulting document, is where another big sub-group of merge problems can surface. Eliminating duplicate resources and making sure that all references within a PDF document are correct is not an easy task. Some tools will attempt to improperly remove duplicate resources, but in the process, may end up breaking the content altogether. Here are some common problems that can result in data loss:

  • Missing fonts
  • Missing images
  • Bookmarks improperly combined
  • PDF Tags and structure problems
  • Metadata loss

Font Subsets and Merging of Other Elements

Another common document merging operation is the subsetting of existing font subsets. Font subsets are usually created during document creation or optimization. Font subsetting is a complex operation that eliminates letters from existing document fonts and creates a new font that contains only letters used in the actual document. This reduces document editability but can greatly reduce file size. Imagine a font with 1,000 letters embedded as a resource in a PDF document. If the actual document is only using 10 letters, there is no need to have the complete font in the document.

In this example below, you can clearly see that the merged PDF created with the Adobe PDF Library only contains the core fonts that are needed, while the other merged PDF created with another PDF tool contains extraneous fonts that are not needed. To be specific, the merged PDF using the other tool contains a copy of each font from the original 100 pages. Since this happens to be the same exact font, the merged PDF using the Adobe PDF Library went ahead and merged all these same fonts, resulting in only 1 font instead of 100 in the final merged document.

Document Merged with Adobe PDF Library

Document Merged with Another PDF Tool

When merging documents, existing font subsets also need to be merged. This can be a tricky part for some applications. To merge the fonts, the application needs to create a superset of all letters used in a document and create a new font subset based on that. This is potentially a very error-prone process that the Adobe PDF Library handles with ease.

When merging documents, bookmarks and table of contents (TOC) also get merged. It’s up to the tool performing the document merge to determine how they will behave in the new document. Non-enterprise grade tools might just ignore and discard these elements. A more sophisticated tool like the Adobe PDF Library will go through the bookmarks and TOC to make sure they point to the correct pages, even after the page numbers have changed.

Document metadata is another tricky area of merging PDF documents. When merging multiple PDFs, the tool being used needs to decide what to do with the metadata. Some tools often discard the metadata. Others will pick arbitrary methods to decide what to do. They will retain the author, title, subject, and keywords metadata from Document A, discard that of Document B, and apply it to the final product. The Adobe PDF Library will always provide users with the option to choose how they would prefer to manage their document metadata.

Merging Files Using the Adobe PDF Library

static void Main(string[] args)
{
    using (Library lib = new Library())
    {
        // Open the two documents we are going to merge
        Document doc1 = new Document("document1.pdf");
        Document doc2 = new Document("document2.pdf");

        // This is the line that performs the document merge.
        // PageInsertFlags controlls how the documents will be merged and optimized. All includes all optimizations
        doc1.InsertPages(Document.LastPage, doc2, 0, Document.AllPages, PageInsertFlags.All);

        // Save the merged document to disk
        doc1.Save(SaveFlags.Full, "document_merged.pdf");
    }
}

The code above is combining font subsets, removing duplicate images and resources, and reworking bookmarks and the TOC to ensure links are working properly - all behind the scenes. Unlike many other tools, the Adobe PDF Library provides users with full control of these processes with the document merge flags provided in the InsertPages() method. The application above preserves the metadata from Document 1, but the Adobe PDF Library provides more control in this area if necessary.

Click the links below to view our samples on Github:

Summary and Best Practices

As we’ve discussed, to ensure that you implement the most accurate PDF document merging process, you should educate yourself on the full capabilities of the tool you select for merging PDFs. When selecting your PDF solutions partner, focus on their knowledge and experience in the PDF community, their ability to implement intuitive solutions for common problems, and who can also provide expert knowledge and advice on tackling complex PDF workflow challenges.

Another recommendation is to always make sure that the PDF tool you use is built for enterprise-level implementation and is fully backed by a dedicated support team that is experienced and knowledgeable. Built with the same core technology that Adobe uses to build Acrobat, the Adobe PDF Library is an SDK designed to edit, assemble, and optimize your PDF documents, ensuring that you provide the best possible file for your audience and users. The Adobe PDF Library is available as a free evaluation download from our website.

To help you get through your next PDF projects pain-free, click here to download a guide that highlights four best practices you should be aware of to achieve the optimal PDF document.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here