How To: Use Office 2007 OCR Using C#

Waleed Elkot

5.00/5 (36 votes)

24 Aug 2009CPOL3 min read

347.2K

23.3K

Reading text from any image using Microsoft Office 2007 OCR

Download sample - 34.8 KB

Introduction

The sample application checks for images in a specified directory and reads text from these images if any. It saves text from each image in a text file with the same name as the image, automatically. It can handle problems or exceptions with images.

If you have Office 2007 installed, the OCR component is available for you to use. The only dependency that's added to your code is Office 2007. Requiring Office (2007 or 2003) to be installed in order for your code to work may or may not fit a situation. But if your client can guarantee that machines that your code will run on have Office (2007 or 2003) installed, then this solution is ideal for you.

What is OCR?

OCR (Optical Character Recognition) is the recognition of printed or written text characters by a computer. This involves photoscanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing.

Or, we can say... Optical character recognition (OCR) translates images of text, such as scanned documents, into actual text characters. Also known as text recognition, OCR makes it possible to edit and reuse the text that is normally locked inside scanned images. OCR works using a form of artificial intelligence known as pattern recognition, to identify individual text characters on a page, including punctuation marks, spaces, and ends of lines.

What is Document Imaging?

Document imaging is the process of scanning paper documents, and converting them to digital images that are then stored on CD, DVD, or other magnetic storage. With Microsoft Office Document Imaging, you can scan paper documents and convert them to digital images that you can save in:

Tagged Image File Format (TIFF): A high-resolution, tag-based graphics format. TIFF is used for the universal interchange of digital graphics.
Microsoft Document Imaging Format (MDI): A high resolution, tag-based graphics format, based on the Tagged Image File Format (TIFF) used for digital graphics.

to your computer’s hard disk, network server, CD, or DVD. Microsoft Office Document Imaging also gives you the ability to perform Optical Character Recognition (OCR) either as part of scanning a document, or while you work with a TIFF or MDI file. By performing OCR, you can then copy recognized text from a scanned image or a fax into a Microsoft Word document or other Office program files.

Weakness

To run the application that uses OCR, you must have the Office OCR Component installed in your machine. That means, without the Office OCR component, your application will not work.

Strength

It's a free component that comes with Office and you can use it in your code for free. It is easy to use because Microsoft presents many sample code for how to use this component.

Namespaces

using System.Collections;
using System.IO;
using System.Drawing.Imaging;

Using the Code

The name of the COM object that you need to add as a reference is Microsoft Office Document Imaging 12.0 Type Library. By default, Office 2007 doesn't install it. You'll need to make sure that it's added by using the Office 2007 installation program. Just run the installer, click on the Continue button with the "Add or Remove Features" selection made, and ensure that the imaging component is installed.

The OCR engine always defaults to the user's regional settings for the LangID argument, unless you specify the language explicitly when calling the OCR method; it does not retain the previously used setting. In a mixed-language environment, it is a good practice to specify the LangID argument explicitly in every call to the OCR method.

So, create a Windows Application using C#. From Visual Studio Solution Explorer >> right click on References >> select the COM tab >> then select Microsoft Office Document Imaging 12.0 Type Library.

/// <summary>
/// Check for Images
/// read text from these images.
/// save text from each image in text file automatically.
/// handle problems with images
/// </summary>
/// <param name="directoryPath">Set Directory Path to check for Images in it</param>
public void CheckFileType(string directoryPath) 
{ 
    IEnumerator files = Directory.GetFiles(directoryPath).GetEnumerator(); 
    while (files.MoveNext()) 
    { 
        //get file extension 
        string fileExtension = Path.GetExtension(Convert.ToString(files.Current));

        //get file name without extension 
        string fileName=
          Convert.ToString(files.Current).Replace(fileExtension,string.Empty);

        //Check for JPG File Format 
        if (fileExtension == ".jpg" || fileExtension == ".JPG")
        // or // ImageFormat.Jpeg.ToString()
        {
            try 
            { 
                //OCR Operations ... 
                MODI.Document md = new MODI.Document(); 
                md.Create(Convert.ToString(files.Current)); 
                md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true); 
                MODI.Image image = (MODI.Image)md.Images[0];

                //create text file with the same Image file name 
                FileStream createFile = 
                  new FileStream(fileName + ".txt",FileMode.CreateNew);
                //save the image text in the text file 
                StreamWriter writeFile = new StreamWriter(createFile); 
                writeFile.Write(image.Layout.Text); 
                writeFile.Close(); 
            } 
            catch (Exception exc) 
            { 
                //uncomment the below code to see the expected errors
                //MessageBox.Show(exc.Message,
                //"OCR Exception",
                //MessageBoxButtons.OK, MessageBoxIcon.Information); 
            } 
        } 
    } 
}

Points of Interest

I have made a big sample application for Office OCR and I'll release it soon.

Remark

There are many people who use OCR for Internet Spiders to get data.

My Blog

http://waleedelkot.blogspot.com/

References

History

24-08-2009: Released

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)