Introduction
Sometimes at the development of an application, we face situations where we have a scanned document (image) and we want to convert it to text (Word 2007 document). Some scanners provide applications that automatically perform this kind of conversion, but most times, the generated document format is a *.pdf or *.odt and so on. If you want to convert directly to *.docx (OpenXML) documents, you'll have to use third-party applications or develop it from scratch.
OpenXML became an ISO standard (IS29500) and its adoption is growing day by day driven by its performance, scalability and security. The format is the default format of Microsoft Office 2007 documents (*.pptx, *.docx, *.xlsx). It's 75 percent smaller than compared binary documents and is based on two major technologies: ZIP and XML.
The Speech recognition is a feature included with .NET Framework 3.5. Developers can use this API and provide better User-Experience, easy access to specific information and so on. The API is available since the .NET Framework 3.0 and it's a default feature of Windows Vista.
Scenario
To facilitate the work of developers and to avoid the integration with third-party applications, Microsoft released with Office 2007 one OCR (Optical Character Recognition) API that's called MODI (Microsoft Office Document Imaging). It's important to remember that the API used in this sample is exclusive of Office 2007 (Office 2003 has its own OCR API).
In this article, we'll create a Windows application that uses the Office 2007 OCR API to generate OpenXML documents. In addition, we'll use the Speech Recognition API to improve the application User-Experience.
Before we start, it's necessary that you already have the following requirements installed:
- Visual Studio 2008
- .NET Framework 3.5
- OpenXML SDK 1.0
- Office 2007
It's necessary that you have installed the Microsoft Office Document Imaging 12.0 Type Library. The Office 2007 installation setup doesn't install this component by default, being necessary to install it later. To do this:
- Run the Office 2007 installation setup
- Click on the button Add or Remove Features
- Make sure that the component is installed
Using the MODI
To use the Office 2007 OCR API, you have to add a reference to Microsoft Office Document Imaging 12.0 Type Library. To do this:
- At Solution Explorer, select Add Reference
- At the COM tab, select Microsoft Office Document Imaging 12.0 Type Library
Create a MODI object:
MODI.Document md;
In the Form
class constructor, instantiate the MODI object:
public Form1()
{
InitializeComponent();
speaker.Rate = -2;
speaker.Volume = 100;
ListFiles = new List<string>();
md = new MODI.Document();
}
After that, you just have to implement the conversion method. Let's see how to do this:
private void OCRImplementation()
{
Cursor = Cursors.WaitCursor;
foreach (string Name in checkedListBox1.CheckedItems)
{
try
{
md.Create(Name);
md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true);
string strText = String.Empty;
MODI.Image image = (MODI.Image)md.Images[0];
MODI.Layout layout = image.Layout;
for (int i = 0; i < layout.Words.Count; i++)
{
MODI.Word word = (MODI.Word)layout.Words[i];
if (strText.Length > 0)
{
strText += " ";
}
strText += word.Text;
}
md.Close(false);
CreateDocument(strText);
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
finally
{
Cursor = Cursors.Default;
}
}
}
The method OCRImplementation
will convert image files (*.tif, *.jpg, *.gif, *.bmp, in this case we're using a TIFF file). The method Create
of the md
object receives the path of the file to be converted. The OCR
method receives three parameters, the first one represents the language of the document, the second parameter specifies whether the OCR engine attempts to determine the orientation of the page and the third parameter specifies whether the OCR engine attempts to fix small angles of misalignment from the vertical.
To retrieve the text, it's necessary to add references to the properties of the objects Image
and Layout
. The object Layout
allows the text retrieval. The property Words
of this object contains the property Count
that allows the iteration through the list of words. You can retrieve the words using indexers, instead we're adding blank spaces between the words.
The method Close
of the md
object takes a boolean argument indicating whether to save changes to the image file.
Using OpenXML SDK
In the Solution Explorer, add a reference to the DocumentFormat.OpenXML
library. This library allows the converted text to become a Word document. There's a constant object that will handle the structure and relationships of the document (It'll define the markup, in this case WordprocessingML
).
private const string PART_TEMPLATE =
"<?xml version='1.0' encoding='UTF-8' standalone='yes'?>" +
"<w:document xmlns:w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'>" +
"<w:body><w:p><w:r><w:t>#REPLACE#</w:t></w:r></w:p></w:body></w:document>";
The method CreateDocument
is responsible for inserting the text inside the document structure.
private void CreateDocument(string Text)
{
WordprocessingDocument wordDoc =
WordprocessingDocument.Create(txt_SavePath.Text,
WordprocessingDocumentType.Document);
MainDocumentPart docPart = wordDoc.MainDocumentPart;
string partML;
docPart = wordDoc.AddMainDocumentPart();
partML = PART_TEMPLATE.Replace("#REPLACE#", Text);
Stream partStream = docPart.GetStream();
UTF8Encoding encoder = new UTF8Encoding();
Byte[] buffer = encoder.GetBytes(partML);
partStream.Write(buffer, 0, buffer.Length);
wordDoc.Close();
}
Speech Recognition
SpeechSynthesizer speaker = new SpeechSynthesizer();
Add a reference to System.Speech
at the .NET tab. After that, you just have to adjust the Volume
and Rate
properties and use the method Speak
to speak a string
.
speaker.Speak("Searching");
Conclusion
It is an interesting idea to combine these powerful APIs, the OCR implemented code is very short if compared with third-party APIs. It is a tool that can be explored in many ways and if integrated with the benefits of OpenXML and Speech Recognition, improves your applications.