(untagged)

Generate Thumbnail Images from PDF Documents

Jonathan Hodgson

0.00/5 (No votes)

11 May 2004

This article presents VB.NET code to create thumbnail images from a directory of Adobe Acrobat PDF documents using the .NET Framework.

Thumbnail images

Introduction

This article presents VB.NET code to create thumbnail images from a directory of Adobe Acrobat PDF documents.

Often when looking for documents it is much easier to find what you want visually, for example seeing the cover of a document.

The application was written for a website that I was developing that needed to display links to PDF documents. Instead of just showing a little PDF icon next to each document we wanted to display the front page of the actual document.

As shown below, this gives the listings better aesthetics and also enables the users to find documents quicker if they recognise it.

PDF Icons
VS
Custom Icons

Note: please ignore the strange text, lorem ipsum is simply dummy text for this example

Hopefully people will agree that having the actual front cover displayed next to the hyperlink works better than the generic PDF icon.

Background

The web site was a Content Management System (CMS) so new PDF documents were uploaded to the site by the users. We then had this application scheduled as a batch service to run every 5 minutes and check for new files.

In the backend system the documents have metadata stored in a SQL Server 2000 database. We would then write a flag to say the thumbnail had been created and when we generated the HTML content for the page request in ASP/ASP.NET we would return the appropriate IMG tag and source as appropriate.

Using the Acrobat SDK also meant we could programmically read the PDF metadata and retrieve the number of pages in the document, which could then be displayed as well. Although the end users could have entered that information it meant less work for them and a better overall impression of the web site. Another advantage was that many users relied on the number of pages to determine how large the document was rather than the more technical Kb/Mb value.

Approach

To generate the thumbnail image for each document I used the Adobe Acrobat 5.0 SDK and the Microsoft .NET 1.1 Framework.

Note: do not confuse the thumbnails that are part of a PDF document with the .png files this application generates.

The Acrobat SDK combined with the full version of Adobe Acrobat (sadly the free reader does not expose the COM interfaces) exposes a COM library of objects that can be used to manipulate and access PDF information.

So using these COM objects via COM Interop, we can load the PDF document, get the first page and render that page to the clipboard. Then using the .NET Framework we can copy this to a bitmap, scale and combine that image and then save the result as a .gif or .png file.

Original rendered PDF page scaled down

At first I just saved the scaled down image, but then decided to �fancy� up the thumbnail with a drop-shadow and folded corner. To achieve this effect I created a transparent .gif, called pdftemplate_portrait.gif, using Macromedia Fireworks MX where the main body of the page template was transparent.

By making the bottom-left pixel transparent too we can easily set the transparent colour for a bitmap in .NET.

I keep the top-right of the image white where the corner folds over, that means I can just combine the images by drawing the transparent template directly over the PDF image to achieve the final look.

Compositing the template and rendered image together

Pre-requisites

The full version of Adobe Acrobat (the free reader does not expose the COM interfaces) which exposes a COM library of objects to manipulate and access PDF information.

The Adobe Acrobat 5.0 SDK which is a free download from the Adobe Solutions Network website (note: the site requires registration). The latest SDK for Acrobat 6.0 requires paid membership, so we will use the previous SDK version.

Link on Adobe website for the Acrobat 5.0 SDK

To quickly see if you have the full version of Adobe Acrobat installed, use regedit.exe and look under HKEY_CLASSES_ROOT for entry entry called AcroExch.PDDoc.

Check for AcroExch.PDDoc

You'll also need the .NET 1.1 Framework and some PDF files to test the solution.

The code was written in VB.NET using the .NET 1.1 Framework and Visual Studio.NET 2003 on Windows XP, but there is no reason it wouldn't work on Windows NT/2000 or .NET 1.0.

Using the code

The code is quite simple with a try/catch over the main body. It is purposely in one large block so it's easy to see what it happening and to step through and examine with the debugger.

Initially we create an instance of AcroExch.PDDoc using late-binding. The referenced Adobe Acrobat 5.0 Type Library (Acrobat.tlb from C:\Program Files\Adobe\Acrobat 5.0 SDK\InterAppCommunicationSupport\Headers) does not expose a COM class you can create using early-binding. By referencing the type library we can get the Intellisense and strong-typing of the other Acrobat objects.

Pass the filename of the PDF documents to be opened to the PDDoc object, which can then be accessed to get metadata on the document; GetNumPages() and GetInfo() for custom document properties.

' Create the document (Can only create the AcroExch.PDDoc object using

' late-binding)

pdfDoc = CreateObject("AcroExch.PDDoc")

' Open the document

ret = pdfDoc.Open(inputFile)

If ret = False Then
    Throw New FileNotFoundException
End If

' Get the number of pages

pageCount = pdfDoc.GetNumPages()

Set a reference to the first page of the document as pdfPage, which is of type Acrobat.CAcroPDPage. From this we can get a rectangle object of the actual page dimensions. One strange point to notice here is that the Adobe Acrobat SDK documentation seems incorrect, as the PDFRect that is returned from the GetSize() method has IDispatch properties x, y but the PDFRect we need to supply to CopyToClipboard must have left, right, top, bottom.

Finally we render the PDF page to the clipboard at full size. We could have Acrobat scale the image down for us by a percentage, but we can get better visual results using the .NET scaling algorithms of the Bitmap class.

It would have been more efficient to render directly to an off-screen bitmap, and also not have overwritten what ever was previously on the clipboard, but I found the clipboard method the most stable way to get a rendered bitmap of the page using Acrobat.

Although it looks like the pdfPage object has a DrawEx method that can take an H<CODE>DC I couldn't get the method to work in a consistently successful way. Calling DrawEx in the paint event of a Windows Forms application did work but it still wouldn't write to an off-screen bitmap directly. Therefore the clipboard method is used and if the process runs on a batch server it won't cause too much worry.

Note: the Draw method is deprecated, as it only works on Win16 systems where hWnd was unique to Windows and not to each process as on NT.

' Get the first page

pdfPage = pdfDoc.AcquirePage(0)

' Get the size of the page

' This is really strange bug/documentation problem

' The PDFRect you get back from GetSize has properties

' x and y, but the PDFRect you have to supply CopyToClipboard

' has left, right, top, bottom

pdfRectTemp = pdfPage.GetSize

' Create PDFRect to hold dimensions of the page

pdfRect = CreateObject("AcroExch.Rect")

pdfRect.Left = 0
pdfRect.right = pdfRectTemp.x
pdfRect.Top = 0
pdfRect.bottom = pdfRectTemp.y

' Render to clipboard, scaled by 100 percent (ie. original size)

' Even though we want a smaller image, better for us to scale in .NET

' than Acrobat as it would greek out small text

' see http://www.adobe.com/support/techdocs/1dd72.htm


Call pdfPage.CopyToClipboard(pdfRect, 0, 0, 100)

Dim clipboardData As IDataObject = Clipboard.GetDataObject()

Grab the rendered page bitmap from the clipboard and based on the pdfRectTemp object determine if it's a portait or landscape document. Set the correct file to load as the template, and if it is landscape, switch the width and height.

Dim pdfBitmap As Bitmap = clipboardData.GetData(DataFormats.Bitmap)

' Size of generated thumbnail in pixels

Dim thumbnailWidth As Integer = 38
Dim thumbnailHeight As Integer = 52
Dim templateFile As String

' Switch between portrait and landscape

If (pdfRectTemp.x < pdfRectTemp.y) Then
   templateFile = templatePortraitFile
Else
   templateFile = templateLandscapeFile
   ' Swap width and height (little trick not using third temp variable)

   thumbnailWidth = thumbnailWidth Xor thumbnailHeight
   thumbnailHeight = thumbnailWidth Xor thumbnailHeight
   thumbnailWidth = thumbnailWidth Xor thumbnailHeight
End If

Load the template file as as Bitmap and as an Image. We use both because the Bitmap class supports MakeTransparent and the image can easily be passed to the Graphics.DrawImage() method. It is slightly inefficent but speed isn't the primarly objective for this application.

Render the pdfImage using the GetThumbnailImage() method of the .NET Framework Bitmap class, this provides a very smooth scaled version of the image.

Next create a blank bitmap with room for the template border. Set the templateBitmap to use the bottom-left pixel of the image as the transparency colour using calling MakeTransparent(). See an article on Chris Sells website for more on transparencies in .NET.

Using the new blank bitmap, draw the rendered pdf page image to it and then the template with transparency directly over the top. Because it is transparent the main area of the page template will still appear through.

Finally, save the composited image back as a .png or .gif file, although .png does look better.

' Load the template graphic

Dim templateBitmap As Bitmap = New Bitmap(templateFile)
Dim templateImage As Image = Image.FromFile(templateFile)

' Render to small image using the bitmap class

Dim pdfImage As Image = pdfBitmap.GetThumbnailImage(thumbnailWidth, _
  thumbnailHeight, _
  Nothing, Nothing)

' Create new blank bitmap (+ 7 for template border)

Dim thumbnailBitmap As Bitmap = New Bitmap(thumbnailWidth + 7, _
  thumbnailHeight + 7, _
  Imaging.PixelFormat.Format32bppArgb)

' To overlayout the template with the image, we need to set the transparency

' http://www.sellsbrothers.com/writing/default.aspx?

' content=dotnetimagerecoloring.htm

templateBitmap.MakeTransparent()

Dim thumbnailGraphics As Graphics = Graphics.FromImage(thumbnailBitmap)

' Draw rendered pdf image to new blank bitmap

thumbnailGraphics.DrawImage(pdfImage, 2, 2, thumbnailWidth, thumbnailHeight)

' Draw template outline over the bitmap (pdf with show through the

' transparent area)

thumbnailGraphics.DrawImage(templateImage, 0, 0)

' Save as .png file

thumbnailBitmap.Save(outputFile, Imaging.ImageFormat.Png)

Write some feedback to the console as we work through each of the files.

Then actively release the reference code to the COM objects as Acrobat it isn't the best suited application to opening and closing multiple PDF documents without falling over. Luckily the code doesn't cause Acrobat to display any UI that might cause the process to hang waiting for user interaction.

Console.WriteLine("Generated thumbnail... {0}", outputFile)

thumbnailGraphics.Dispose()

pdfDoc.Close()
Marshal.ReleaseComObject(pdfPage)
Marshal.ReleaseComObject(pdfRect)
Marshal.ReleaseComObject(pdfDoc)

Visual Studio.NET Solution

The project you can download has all the VB.NET code and the COM Interop DLL that was generated. Even though the application is actually a console application we still need System.Windows.Form as the clipboard dataformats are from there.

Use the app.config to set the input and output paths for the .pdf files and .png files respectively. By default it reads and write to C:\thumbnails\.

Output

Running the PDFThumbnail.exe console application will enumerate all the .pdf files in the directory specified in the .config file writing out a .png image of the first page.

Which we can see in the screenshot below.

Explorer view

Further Enhancements

Further improvements might be to:

Render directly to an off-screen bitmap rather than to the clipboard.
Remove the reliance on having a full version of Adobe Acrobat by using Ghostscript libraries instead.

One case we had was documents that could be viewed internally but were blocked due to compliance issues for external users, by designing different templates and rendering them with the page it was obviously the document was private further enhancing usuability, eg.

Points of Interest

The Adobe Acrobat 5.0 SDK is not the greatest written documentation but most information is there if you dig a little.

If running under an NT service account the screen resolution and depth make a difference; for example if your server is only set for 256 colours in 640 x 480, and if the console application is run via the service it will not be able to render 24-bit colour thumbnails. I've seen the same effect when using charting controls from ASP, where the production IIS servers had low screen resolutions set and the colour-depth of the charts was low.

Also, if running in a batch on a server you should check the terms of the Acrobat license agreement to whether you are allowed to run the Adobe Acrobat application in a server-type process.

The images are about 2-3Kb in size and for about 3Gb of documents the thumbnails would take an additional 60MB - so storage requirements are not excessive. The actual time to generate thumbnails for thousands of documents would be a few hours, as Acrobat needs to load each document as well as the rendering to the clipboard, and the .NET bitmap scaling, etc.

References

Microsoft .NET Framework 1.1 documentation
Chris Sells' web site for the transparency example code
Adobe Acrobat 5.0 SDK documentation and examples
Code Complete Second Edition for the example PDF document (which I hope Steve doesn't mind me including and which I can totally recommend even nearly ten years since it was first published)

Conclusion

This article has shown how to manipulate PDF documents using the Acrobat SDK and combine images using the .NET framework.

At first it can be quite daunting trying to find good information on working with PDF documents programmatically, although there are now a number of good commercial components which hide a lot of the underlying postscript complexities.

I originally wrote this utility in Visual Basic 6 using a third-party imaging components, but now it is easier to share the code using the .NET framework. Especially as the complex imaging and manipulation can now be done with a few simple statements.

Thanks and I hope you enjoyed reading this article; I'd be interested to hear if people found it useful.

History

19th January 2004 - Initial release to the Code Project.
12th May 2004 - Added C# version

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here