Introduction
Dynamsoft’s Dynamic Web
TWAIN SDK is a powerful solution for web-based image processing, which
allows developers to ignore the low-level details, and focus on what is important.
In July, 2012 Dynamic Web TWAIN 8.0 was released to the market, adding new
cutting-edge technology to the API. Two new powerful add-ons were added which add
support for OCR (Optical Character Recognition), and Barcode Recognition. This
document will focus on the OCR
add-on, which allows developers to make use of this technology without
worrying about the low-level implementation, while maintaining flexibility.
The web version of the SDK is controlled by JavaScript, a
scripting language that almost all web developers are familiar with. A wide
library of functions and properties is provided to give developers full
control, and implement their own interface easily. All popular browsers are
supported, and the underlying Web TWAIN software is provided as an ActiveX
control as well as a browser plugin. Users of Internet Explorer can make use of
the ActiveX version, and users of other browsers, such as Firefox, Chrome, Safari,
and Opera, can make use of the plugin version. For those not using the Windows
operating system, a Mac plugin edition is also available.
Below, we’ll make use of Dynamsoft’s new Dynamic Web TWAIN
OCR add-on to extract text from a scanned image. All three output modes will be
demonstrated: Plain Text, Plain Text PDF, and Image over Text PDF. If the
Dynamic Web TWAIN SDK is not yet installed on your system, you can easily
download a trial, or view and use the online demo on the Dynamsoft
web-site.
What Dynamic OCR Supports
-
Over 40 languages, including Arabic and various Asian languages.
- All the common file formats: jpg, gif, png, bmp, tiff, and more.
- Multiple-page document processing.
- Hand-written and printed characters.
- Font name and size recognition.
- Detailed positioning and format information.
- PDF output maintaining the look of the original document.
- Integration with Dynamic Web TWAIN, so images can be edited before
OCR is performed.
How to Use OCR in Your
Application
The following code samples are all provided in JavaScript,
under the assumption that a WebTWAIN object has already been created with the
variable name WebTWAIN.
Understanding OCR Settings
Before OCR begins, a number of settings can be changed which
affect the output. While sensible defaults are provided, there are many
situations where it may be appropriate to change settings.
OCRDllPath
: This needs to be set to the path where
the OCR dll file is located. By default, this is set to the current working
directory.
OCRTessDataPath
: This needs to be set to the root
path for the application, where the tessdata folder for the language you wish
to use is located.
OCRLanguage
: By default, this is set to "eng". This
refers to the prefix of the language files you wish to use.
OCRResultFormat
: An important setting that determines
whether the OCR results are saved to a text file or a PDF file.
OCRUseDetectedFont
: Determines whether or not OCR
should try to use the detected font faces when outputting PDF documents. The
detected fonts need to exist on the system for them to be used, so use this
option with caution.
OCRUnicodeFontName
: A backup font to be used in the
case that OCRUseDetectedFont
is off. This font must also exist on the
system. If this is not provided, the library will attempt to use an appropriate
font for the language used.
OCRPageSetMode
: The default is fine for most
operations for this setting. This affects the way the page formatting is
determined by OCR, and by default is set to automatic.
OCR On An Image in the TWAIN Buffer
In this example, we will perform English OCR on a given
image by index in the visible images that have been loaded into TWAIN. The
results get saved to a text file called OCRResults.txt. Since we’re only doing
text output, we don’t need to worry about any font settings, and the defaults
are used for all the other settings, such as OCRDllPath
and OCRTessDataPath
.
function DoOCR(imageIndex) {
WebTWAIN.OCRLanguage = "eng";
WebTWAIN.OCRResultFormat = 0; return WebTWAIN.OCR(imageIndex, "OCRResults.txt"); }
OCR on an External Image
The OCR functions are very flexible, and also allow for
images to be loaded directly from multiple files, with paths relative to the
working directory. In this sample, we will also use a different setting for
OCRTessDataPath, and use Simplified Chinese for the language.
function ChineseOCR() {
WebTWAIN.OCRLanguage = "chi_sim";
WebTWAIN.OCRResultFormat = 0; WebTWAIN.OCRTessDataPath = "../../tesseract-chinese/"; return WebTWAIN.OCRDirectly("wendang1.tif|wendang2.tif|webdang3.tif", "ChineseOCR.txt");
}
Using Chinese for the language demonstrates the UTF output
capabilities of Dynamic OCR. The resulting text is encoded in UTF-8 Unicode,
and the file should be read by an editor that supports UTF-8.
OCR to a Plain-text PDF
Dynamic OCR also supports output directly to PDF files. The
simplest way to do this is to output text only, which is perfect for documents
and scans that contain primarily text. The given screenshots show a piece of
the results of the below source code for PDF output.
function OCRToPDF() {
WebTWAIN.OCRLanguage = "eng";
WebTWAIN.OCRResultFormat = 1; return WebTWAIN.OCRDirectly("Demo_OCR1.png", "Plain.pdf");
}
Original image:
Plain-text PDF:
As can be seen clearly, the plain-text PDF maintains all the
text and its positioning information perfectly. However the
colour, italics, and images are lost. In many cases this may be an acceptable
or the desired result, however in other cases the images are important, and the
Image-over-Text option should be used instead.
OCR to an Image-over-Text PDF
Image-over-Text PDFs maintain the original look of the
document, but add the ability to select, copy, and search text. They are ideal
for scans of complex tables, books, or other documents that contain images and
complicated formatting. Below is an example of the same code, except with
Image-over-Text as the OCRResultFormat.
function OCRToPDF() {
WebTWAIN.OCRLanguage = "eng";
WebTWAIN.OCRResultFormat = 2; return WebTWAIN.OCRDirectly("Demo_OCR1.png", "ImageOverText.pdf");
}
The above image is a screenshot of the resulting PDF, with some of the
text being selected. As you can see, the text selection is accurate, and the
OCR results could be copied or searched through just as if it were a text
document.
OCR to a String in Memory
Of course, the results of OCR can also be saved in memory,
whether in the form of plain text or a PDF. In this sample, the results are
saved to a plain text string. A string could also hold Base64 encoded PDF results
if the OCRResultFormat was not set to 0. After the results are saved, they are
written to the page with document.write.
function OCRToString(imageIndex)
WebTWAIN.OCRLanguage = "eng";
WebTWAIN.OCRResultFormat = 0; return WebTWAIN.OCREx(imageIndex);
}
var results = OCRToString(0); document.write(results);
Download the Sample
To try out the above mentioned features by yourself, you can go to the online demo at:
Dynamic Web TWAIN OCR Barcode Online Demo
If you’d like to evaluate Dynamic Web TWAIN which includes the OCR add-on, you can download the free trial here:
Dynamic Web TWAIN 30-day Free Trial.
If you have any questions, you can contact our support team at twainsupport@dynamsoft.com.