Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Converting Scanned Document Images to Searchable PDFs with OCR

14 Dec 2006 2  
Demonstrates the use of Atalasoft's DotImage GlyphReader OCR to enable .NET applications to digitize paper documents as searchable PDFs that can be indexed by search engines.

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

Introduction

From health records, tax forms, and insurance claims, to old memos, magazines, and books; businesses are digitizing paper every day. With the advent of better search technology, having searchable text for all these documents is an obvious win. The common way to do this is to use OCR (Optical Character Recognition) to translate the images to a document format that indexers already know, but the drawback is that we often lose the layout, images and color of the original – plus, since no OCR is perfect, we need the original image to be able to fix mistakes. What we want is a document format that looks like the original images when humans look at it, but that looks like plain text when the indexer looks at it. And, when we copy from the image, we want text put on the clipboard. This is the promise of the searchable PDF.

In a searchable PDF, the original scanned image is retained so any human can read the document. The textual content that is extracted via OCR is put behind the image so search indexers can see it and Acrobat Reader will let us select it as text. The ubiquity of desktop and enterprise search, ever-increasing OCR accuracy, and mass adoption of PDF are a powerful combination that make searchable PDF's the ideal format to store digitized paper.

This article will demonstrate just how simple it is to develop an application that generates these searchable PDF's from scanned documents that can be indexed by Google, Sharepoint, Microsoft desktop search, and other applications that will index PDF documents.

To help build this application, Atalasoft publishes an OCR framework that simplifies working with industry leading OCR engines and our own highly accurate engine, GlyphReader. A free 30-day evaluation of the Atalasoft DotImage Document Imaging SDK, including the OCR module, GlyphReader, and all other add-ons can be downloaded from atalasoft.com.

Using our framework, these steps are handled for you:

  1. Decompress the image
  2. Pre-process the image to make OCR more accurate (including cleaning it or deskewing it)
  3. OCR the image to extract the text.
  4. Re-encode the image in a choice of formats, including CCIT Group 4, JBIG2, JPEG, or JPEG2000 for the absolute smallest file size possible.
  5. Construct a PDF with the image and the extracted text, with each word accurately positioned behind the appropriate place in the image.

Atalasoft's OCR framework includes a flexible Translator interface for producing output from the recognition process. For example, TextTranslator is available out of the box and generates a text stream. The Searchable PDF Module includes the PdfTranslator and is used to generate text only PDF's or Image with hidden text PDF's. Both are "searchable", but the latter includes the original image and is what we are going to use.

This article will use the following 2-page color TIFF as the source document to OCR. Shown here are the lower resolution images of the original scanned TIFF (a recent white paper from Atalasoft that was printed, and scanned in color).

Extracting the Text into a Text File

Let's start with a method that simply extracts the text into a file. First, we must create an ImageSource object to efficiently handle multi-page image files. Then we create the OCR engine, initialize it, translate it to the desired MIME type, and shutdown the engine.

void MakeText(string inFile, string outFile)
{
    using (FileSystemImageSource fis = 
           new FileSystemImageSource(new string[1] { inFile }, true))
    {
        GlyphReaderEngine ocr = new GlyphReaderEngine();
        ocr.Initialize();
        ocr.Translate(fis, "text/plain", outFile);
        ocr.ShutDown();
    }
}

The resulting text file obviously does not look at all like the original document, but it does contain the text. It also isn't stored in the same file as the image. We can do better.

Creating the Searchable PDF

For the next code sample, we'll use a PdfTranslator to create a searchable PDF. To do this we need to:

  1. Create an instance of the PdfTranslator
  2. Set its OutputType to TextUnderImage (to create a searchable PDF)
  3. Add it to the OcrEngine's Translators collection (since it's an add-on, it doesn't come pre-registered)
  4. Use the engine to translate with the output MIME type set to "application/pdf"

Here's the code:

void MakePdf(string inFile, string outFile)
{
    using (FileSystemImageSource fis = 
           new FileSystemImageSource(new string[1] { inFile }, true))
    {
        GlyphReaderEngine ocr = new GlyphReaderEngine();
        PdfTranslator pdfTrans = new PdfTranslator();
        pdfTrans.OutputType = PdfTranslatorOutputType.TextUnderImage;
        ocr.Translators.Add(pdfTrans);
        ocr.Initialize();
        ocr.Translate(fis, "application/pdf", outFile);
        ocr.ShutDown();
    }
}

The result is a high quality searchable PDF! When opening the PDF into Acrobat Reader (see screenshot below), all text in the document can be selected as real text, even though the visible part of this PDF is the actual color rasterized image.

The OCR Engine and PDF Translator handle all the details required to deskew the image, store it, produce accurate OCR, compress the image, accurately place the recognized text under the right part of the image, and generate the PDF document.

Simply having this file on your filesystem will cause Google Desktop Search, or Windows Desktop Search to index this document properly, with the document looking exactly like the original.

Product Requirements

To add searchable PDF generation to your applications, you will need the following products from Atalasoft:

  • DotImage Document Imaging SDK
  • OCR GlyphReader Engine Module (runtimes are additional)
  • OCR Searchable PDF Module (includes 20 runtimes)

Everything is included in the DotImage SDK which you can download and evaluate free for 30 days. Be sure to request Evaluation Licenses for the required products. Attached to this article is the resulting PDF and C# 2.0 source code for a simple console application where the first argument is the input image file, and the second argument is the resulting searchable PDF file.

Archives

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here