(untagged)

OCR Documents in .NET

Dynamsoft

21 Jan 2013

OCR Documents in .NET.

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

Overview

Dynamsoft’s OCR SDK is an add-on of Dynamic .NET TWAIN, an image acquisition SDK optimized for .NET applications. The OCR SDK allows you to convert scanned images to searchable PDF/text files. Recognized as a useful feature, it’s not easy to implement it. A lot of complicated things, such as accuracy, image format and more, are involved to get better results. OCR performance is another important factor that affects the efficiency of the whole process.

Dynamsoft’s OCR SDK, optimized based on the highly developed open source engine (Tesseract OCR engine), helps you relieve from these burdens. By integrating with Dynamic .NET TWAIN, you can create a robust image acquisition and processing solution in several lines of source code.

Key Features

Supports more than 40 languages, including Arabic and various Asian languages.
High OCR performance by supporting multi-thread processing.
Accurate recognition with font identification
Easy integration with the image acquisition SDK – Dynamic .NET TWAIN.

The following sections will show you how to integrate the OCR add-on to your WinForm application and convert scanned images to searchable PDF/text files.

Source Code

1. Embed Dynamic .NET TWAIN to your WinForm or WPF app.

We will take WinForm as an example.

Assume you’ve already downloaded and installed the .NET component onto your development machine (If not, please download the 30-day free trial from Dynamsoft’s website.).

Open your WinForm app or create a new one in Visual Studio. From the Tools menu, select Choose Toolbox Items. In the prompt dialog box, click Browse and select DynamicDotNetTWAIN.dll which can be found in the installation folder of Dynamic .NET TWAIN. Click OK to close the dialog box.

Drag and drop the component to the form.

2. Scan images from scanners, webcams or get from local folders.

Dynamic .NET TWAIN supports getting images from various sources, including scanners, webcams and other TWAIN/WIA/UVC compatible devices. In this article, I’ll show you how to load an existing image from your local disk.

SetViewMode: defines the view mode of the control.
LoadImage: loads the existing local images. Supported image format includes BMP, PNG, JPEG, TIFF (both single and multi-page) and PDF (both single and multi-page).

            this.dynamicDotNetTwain1.SetViewMode(1, 1);
            OpenFileDialog filedlg = new OpenFileDialog();

            if (filedlg.ShowDialog() == DialogResult.OK)
            {
                foreach (string strfilename in filedlg.FileNames)
                {
                    this.dynamicDotNetTwain1.LoadImage(strfilename);
                }
            }

3. Initialize the OCR add-on and choose the language package.

1) Choose the language package and define the path of the package by using the OCRTessDataPath property.

Dynamsoft’s OCR SDK supports more than 40 languages, including English, Spanish, Arabic and more. The sample code below chooses English as the default language. Other language packages can be downloaded from Dynamsoft’s website: OCR SDK Language Packages

            string languageFolder = Application.StartupPath;

            this.dynamicDotNetTwain1.OCRTessDataPath = languageFolder;
            this.dynamicDotNetTwain1.OCRLanguage = "eng";

2) Set the path of DynamicOCR.dll or DynamicOCRx64.dll to initialize the OCR add-on.

this.dynamicDotNetTwain1.OCRDllPath = "";

3) Choose the OCR result file format and save. Supported file format includes Text, PDF Plain Text and PDF Image over Text. By setting the format to PDF Image over Text, the detailed image/text position and format, such as font names, font sizes, line widths and more, will keep as original.

            this.dynamicDotNetTwain1.OCRResultFormat = (Dynamsoft.DotNet.TWAIN.OCR.ResultFormat)this.ddlResultFormat.SelectedIndex;
            

            byte[] sbytes = this.dynamicDotNetTwain1.OCR(this.dynamicDotNetTwain1.CurrentSelectedImageIndicesInBuffer);

            if(sbytes != null)
            {
                SaveFileDialog filedlg = new SaveFileDialog();
                if (this.ddlResultFormat.SelectedIndex != 0)
                {
                    filedlg.Filter = "PDF File(*.pdf)| *.pdf";
                }
                else
                {
                    filedlg.Filter = "Text File(*.txt)| *.txt";
                }
                if (filedlg.ShowDialog() == DialogResult.OK)
                {
                    FileStream fs = File.OpenWrite(filedlg.FileName);
                    fs.Write(sbytes, 0, sbytes.Length);
                    fs.Close();
                }
            }
            else
            {
                MessageBox.Show(this.dynamicDotNetTwain1.ErrorString);
            }

Distribution

To distribute the application to the end users, please copy the following files to the client machine along with the EXE file.

The language package
DynamicOCR.dll (for 32-bit Windows OS) and/or DynamicOCRx64.dll (for 64-bit Windows OS)
DynamicDotNetTwain.dll

Xcopy deployment is also supported.

Resources

The complete source code of OCR can be downloaded from the article. To test and/or customize the code, you can download the trial version of Dynamic .NET TWAIN from Dynamsoft’s website.

Download Dynamic .NET TWAIN 30-Day Free Trial

Other demos/samples of .NET image acquisition and processing can be found here:

Dynamic .NET TWAIN Demos

If you have any questions, you can contact our support team at nettwain@dynamsoft.com.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here