Converting Images to Text using Office 2007 OCR, OpenXML and Speech Recognition

Danie lCampos

4.58/5 (18 votes)

19 Feb 2009CPOL4 min read

8.2K

This article will show how to integrate the Office 2007 OCR engine with custom applications and use OpenXML and Speech Recognition

Download source - 45.16 KB

Introduction

Sometimes at the development of an application, we face situations where we have a scanned document (image) and we want to convert it to text (Word 2007 document). Some scanners provide applications that automatically perform this kind of conversion, but most times, the generated document format is a *.pdf or *.odt and so on. If you want to convert directly to *.docx (OpenXML) documents, you'll have to use third-party applications or develop it from scratch.

OpenXML became an ISO standard (IS29500) and its adoption is growing day by day driven by its performance, scalability and security. The format is the default format of Microsoft Office 2007 documents (*.pptx, *.docx, *.xlsx). It's 75 percent smaller than compared binary documents and is based on two major technologies: ZIP and XML.

The Speech recognition is a feature included with .NET Framework 3.5. Developers can use this API and provide better User-Experience, easy access to specific information and so on. The API is available since the .NET Framework 3.0 and it's a default feature of Windows Vista.

Scenario

To facilitate the work of developers and to avoid the integration with third-party applications, Microsoft released with Office 2007 one OCR (Optical Character Recognition) API that's called MODI (Microsoft Office Document Imaging). It's important to remember that the API used in this sample is exclusive of Office 2007 (Office 2003 has its own OCR API).

In this article, we'll create a Windows application that uses the Office 2007 OCR API to generate OpenXML documents. In addition, we'll use the Speech Recognition API to improve the application User-Experience.

Before we start, it's necessary that you already have the following requirements installed:

Visual Studio 2008
.NET Framework 3.5
OpenXML SDK 1.0
Office 2007

It's necessary that you have installed the Microsoft Office Document Imaging 12.0 Type Library. The Office 2007 installation setup doesn't install this component by default, being necessary to install it later. To do this:

Run the Office 2007 installation setup
Click on the button Add or Remove Features
Make sure that the component is installed

Using the MODI

To use the Office 2007 OCR API, you have to add a reference to Microsoft Office Document Imaging 12.0 Type Library. To do this:

At Solution Explorer, select Add Reference
At the COM tab, select Microsoft Office Document Imaging 12.0 Type Library

Create a MODI object:

/// <summary>
/// Document Imaging Library
/// </summary>
MODI.Document md;

In the Form class constructor, instantiate the MODI object:

public Form1()
{
    InitializeComponent();
    speaker.Rate = -2;
    speaker.Volume = 100;
    ListFiles = new List<string>();
    md = new MODI.Document();
}

After that, you just have to implement the conversion method. Let's see how to do this:

private void OCRImplementation()
{
    Cursor = Cursors.WaitCursor;
    foreach (string Name in checkedListBox1.CheckedItems)
    {
        try
        {
            md.Create(Name);
            md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true);
            string strText = String.Empty;
            MODI.Image image = (MODI.Image)md.Images[0];
            MODI.Layout layout = image.Layout;
            for (int i = 0; i < layout.Words.Count; i++)
            {
                MODI.Word word = (MODI.Word)layout.Words[i];
                if (strText.Length > 0)
                {
                    strText += " ";
                  }
                strText += word.Text;
            }
            md.Close(false);
            CreateDocument(strText);
        }
        catch (Exception ex)
        {
            MessageBox.Show(ex.Message);
        }
        finally
        {
            Cursor = Cursors.Default;
        }
    }
}

The method OCRImplementation will convert image files (*.tif, *.jpg, *.gif, *.bmp, in this case we're using a TIFF file). The method Create of the md object receives the path of the file to be converted. The OCR method receives three parameters, the first one represents the language of the document, the second parameter specifies whether the OCR engine attempts to determine the orientation of the page and the third parameter specifies whether the OCR engine attempts to fix small angles of misalignment from the vertical.

To retrieve the text, it's necessary to add references to the properties of the objects Image and Layout. The object Layout allows the text retrieval. The property Words of this object contains the property Count that allows the iteration through the list of words. You can retrieve the words using indexers, instead we're adding blank spaces between the words.

The method Close of the md object takes a boolean argument indicating whether to save changes to the image file.

Using OpenXML SDK

In the Solution Explorer, add a reference to the DocumentFormat.OpenXML library. This library allows the converted text to become a Word document. There's a constant object that will handle the structure and relationships of the document (It'll define the markup, in this case WordprocessingML).

private const string PART_TEMPLATE =
 "<?xml version='1.0' encoding='UTF-8' standalone='yes'?>" +
"<w:document xmlns:w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'>" +
"<w:body><w:p><w:r><w:t>#REPLACE#</w:t></w:r></w:p></w:body></w:document>";

The method CreateDocument is responsible for inserting the text inside the document structure.

 private void CreateDocument(string Text)
{
    WordprocessingDocument wordDoc =
       WordprocessingDocument.Create(txt_SavePath.Text,
  WordprocessingDocumentType.Document);
    MainDocumentPart docPart = wordDoc.MainDocumentPart;

    string partML;
    docPart = wordDoc.AddMainDocumentPart();

    partML = PART_TEMPLATE.Replace("#REPLACE#", Text);

    Stream partStream = docPart.GetStream();
    UTF8Encoding encoder = new UTF8Encoding();
    Byte[] buffer = encoder.GetBytes(partML);
    partStream.Write(buffer, 0, buffer.Length);
    wordDoc.Close();
}

Speech Recognition

/// <summary>
/// synthesis speech
/// </summary>
SpeechSynthesizer speaker = new SpeechSynthesizer();

Add a reference to System.Speech at the .NET tab. After that, you just have to adjust the Volume and Rate properties and use the method Speak to speak a string.

speaker.Speak("Searching");

Conclusion

It is an interesting idea to combine these powerful APIs, the OCR implemented code is very short if compared with third-party APIs. It is a tool that can be explored in many ways and if integrated with the benefits of OpenXML and Speech Recognition, improves your applications.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)