Introduction
The sample application checks for images in a specified directory and reads text from these images if any. It saves text from each image in a text file with the same name as the image, automatically. It can handle problems or exceptions with images.
If you have Office 2007 installed, the OCR component is available for you to use. The only dependency that's added to your code is Office 2007. Requiring Office (2007 or 2003) to be installed in order for your code to work may or may not fit a situation. But if your client can guarantee that machines that your code will run on have Office (2007 or 2003) installed, then this solution is ideal for you.
What is OCR?
OCR (Optical Character Recognition) is the recognition of printed or written text characters by a computer. This involves photoscanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing.
Or, we can say... Optical character recognition (OCR) translates images of text, such as scanned documents, into actual text characters. Also known as text recognition, OCR makes it possible to edit and reuse the text that is normally locked inside scanned images. OCR works using a form of artificial intelligence known as pattern recognition, to identify individual text characters on a page, including punctuation marks, spaces, and ends of lines.
What is Document Imaging?
Document imaging is the process of scanning paper documents, and converting them to digital images that are then stored on CD, DVD, or other magnetic storage. With Microsoft Office Document Imaging, you can scan paper documents and convert them to digital images that you can save in:
- Tagged Image File Format (TIFF): A high-resolution, tag-based graphics format. TIFF is used for the universal interchange of digital graphics.
- Microsoft Document Imaging Format (MDI): A high resolution, tag-based graphics format, based on the Tagged Image File Format (TIFF) used for digital graphics.
to your computer’s hard disk, network server, CD, or DVD. Microsoft Office Document Imaging also gives you the ability to perform Optical Character Recognition (OCR) either as part of scanning a document, or while you work with a TIFF or MDI file. By performing OCR, you can then copy recognized text from a scanned image or a fax into a Microsoft Word document or other Office program files.
Weakness
To run the application that uses OCR, you must have the Office OCR Component installed in your machine. That means, without the Office OCR component, your application will not work.
Strength
It's a free component that comes with Office and you can use it in your code for free. It is easy to use because Microsoft presents many sample code for how to use this component.
Namespaces
using System.Collections;
using System.IO;
using System.Drawing.Imaging;
Using the Code
The name of the COM object that you need to add as a reference is Microsoft Office Document Imaging 12.0 Type Library. By default, Office 2007 doesn't install it. You'll need to make sure that it's added by using the Office 2007 installation program. Just run the installer, click on the Continue button with the "Add or Remove Features" selection made, and ensure that the imaging component is installed.
The OCR engine always defaults to the user's regional settings for the LangID
argument, unless you specify the language explicitly when calling the OCR method; it does not retain the previously used setting. In a mixed-language environment, it is a good practice to specify the LangID
argument explicitly in every call to the OCR method.
So, create a Windows Application using C#. From Visual Studio Solution Explorer >> right click on References >> select the COM tab >> then select Microsoft Office Document Imaging 12.0 Type Library.
public void CheckFileType(string directoryPath)
{
IEnumerator files = Directory.GetFiles(directoryPath).GetEnumerator();
while (files.MoveNext())
{
string fileExtension = Path.GetExtension(Convert.ToString(files.Current));
string fileName=
Convert.ToString(files.Current).Replace(fileExtension,string.Empty);
if (fileExtension == ".jpg" || fileExtension == ".JPG")
{
try
{
MODI.Document md = new MODI.Document();
md.Create(Convert.ToString(files.Current));
md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true);
MODI.Image image = (MODI.Image)md.Images[0];
FileStream createFile =
new FileStream(fileName + ".txt",FileMode.CreateNew);
StreamWriter writeFile = new StreamWriter(createFile);
writeFile.Write(image.Layout.Text);
writeFile.Close();
}
catch (Exception exc)
{
}
}
}
}
Points of Interest
I have made a big sample application for Office OCR and I'll release it soon.
Remark
There are many people who use OCR for Internet Spiders to get data.
My Blog
References
History