Introduction
OCR combined with a powerful approximate regular expression engine can capture and search data from text on images that would otherwise be lost. Even in today’s digital age, many companies still rely on paper documents. In order to bridge the gap, Optical Character Recognition (OCR) captures the data on those paper documents and brings that data into the digital workspace. OCR technology is very useful in a number of different instances and you can create solutions that are even more powerful by adding regular expression search with approximate matching to the OCR technology. Searchable document creation, capturing bank check amounts, getting dollar amounts from an invoice, redaction of sensitive data, and indexing documents for subsequent search are just a few of the typical uses for OCR and regular expression search.
In this article, we review some of the existing problems where this technology can provide a solution. We also give an overview of the technology used to create solutions for these problems. Finally, we demonstrate the power of this combined technology by implementing one of the use cases. The associated sample code and a trial download of Pegasus Imaging’s full-page OCR SDK can be found here.
The following use cases are common examples of where OCR is used.
Searchable Document Creation
When documents exist as images, either as digital fax or as scanned documents, they are not in a format that is easy to search. OCR converts the image of text into actual searchable text. You can combine this text with the original image in PDF files or XPS files. This is useful if you need to preserve the original image for legal reasons, such as when a signature is present on the image, but you also need to search the text. Google Desktop and Windows Desktop Search will index these OCR-created PDF files and XPS files, allowing you to find desired documents through routine text searches. Full-page OCR solutions, such as OCR Xpress, are best suited for this use.
Forms Processing
Insurance forms, entrance exams, tax returns, invoices, and checks are documents that many businesses process on a daily basis. Some businesses receive thousands, possibly even millions, of these documents every day. Forms processing is an automated way to process these documents. Most forms processing solutions use OCR to gather machine print data, ICR to gather hand written data, and OMR to detect filled in check boxes or bubbles. Structured forms processing typically uses zonal OCR and ICR, such as SmartZone v2, to collect data from form fields. Semi-structured and unstructured forms processing vary in using zonal or full page OCR, depending on the implementation.
Redaction of Sensitive Data
Redacting sensitive data from images is another important use of OCR. With the continued concern about privacy, the requirement to redact social security numbers, birth dates and other sensitive data from images is becoming more common. Businesses and government organizations frequently publish customer submitted document images on web sites. The organizations that collect these documents must remove or redact the sensitive data that exist in these documents prior to publishing them. Recent privacy legislation makes this a requirement for many types of document images. In our example below, we will develop a simple search and redaction program to demonstrate the combined power of an accurate OCR engine with an approximate regular expression engine.
Limitations of Technology
Many OCR Engines today approach or exceed 99% accuracy. In many use cases, this is sufficient accuracy for the problem being addressed. However, some applications require higher accuracy. There are a number of ways to improve the recognition accuracy of an OCR engine. Starting with clean images is one way to help improve accuracy. Using the best technology when converting images from color or gray images to black and white (binary) can also improve recognition accuracy. Starting with higher resolution images, 300 DPI or higher, helps the recognition process. Using multiple OCR engines and comparing results can lead to a reduction in recognition errors as well.
Unfortunately, not all of these options may be possible. The images may have originated outside of the control of the organization, resulting in poorly acquired images with tears, or speckles, or dark, low-resolution images. Some image cleanup can help, but may not be enough to get the OCR engine to 100% accuracy, despite claims to the contrary.
Overcoming the Limitations
When the solution involves searching for text patterns and 100% OCR accuracy cannot be guaranteed, another technology is required to help improve search results. Approximate regular expression search engines help improve search results.
What is an approximate regular expression search engine?
Regular expressions allow users to define patterns used to search for particular text in strings. If you have ever used “dir *.c” in a command line, you are using a variant of regular expressions. The best way to understand regular expressions is through an example. You can use a pattern of “\d\d\d” to search for any three digits in a row. This regular expression applied against the string “abc 123” will match the “123”. The regular expression engine will return an index into the input string to indicate where it found the match. In this example, the index is 4 (zero based index).
Approximate regular expressions extend the regular expression functionality by allowing errors in the match in the form of insertions, deletions and substitutions of characters in the string and still match the pattern.
If the OCR engine misreads the string and returns “abc 1Z3”, where the 2 is replaced with the letter Z, approximate regular expressions could still match the “/d/d/d” pattern when substitutions are allowed. Substituting the ‘Z’ for a ‘2’, or any other number, allows the pattern to match “1Z3. If the OCR engine inserts text into the string, for example “abc 1i23”, then with insertions allowed the pattern still matches “1i23”. And with deletions allowed against the string “abc 12”, the pattern matches “12”.
Example Implementation: Searching Images for Information
For this example, we’ll first need to download the OCR Xpress v2 SDK that also contains ImagXpress v9. Next, we will load an image that contains text as input using ImagXpress. Using the OCR Xpress engine, we will recognize the text on that image. We will then search the recognized text for a regular expression pattern using the approximate regular expression engine built into OCR Xpress v2. Next, we will highlight the text on the screen using NotateXpress v9 (also included in the OCR Xpress SDK). Finally, we export to a searchable, image over text PDF, with the text redacted from the image and removed from the searchable text.
Recognition of Text
The first step in creating this application is to load the image and perform recognition on the loaded image. There are a few other maintenance steps shown in the code below. The toolkits make the whole process straight forward and easy to use.
ImageX documentImage = ImageX.FromStream(m_imagXpress,
openFileDialog.OpenFile());
System.Drawing.Bitmap
will need to dispose
this efficiently.
using (System.Drawing.Bitmap
theImage = documentImage.ToBitmap(false))
{
m_ocrXpressPage = m_ocrXpress.Document.AddPage(theImage);
}
to search
m_ocrXpress.Document.AutoRotate(m_ocrXpressPage);
m_ocrXpress.Document.Deskew(m_ocrXpressPage);
m_ocrXpress.Document.Recognize(m_ocrXpressPage);
if (m_ocrXpressPage.BitonalBitmap == null)
m_imageXView.Image = ImageX.FromBitmap(m_imagXpress,
m_ocrXpressPage.Bitmap);
else
m_imageXView.Image
= ImageX.FromBitmap(m_imagXpress,
m_ocrXpressPage.BitonalBitmap);
Set up the Pattern for Search
Once the image has been loaded and recognized, the user can enter a search string, or choose from a predefined search pattern, such as a phone number. The code below shows how to set up the pattern in OCR Xpress and then perform the matching. If the user of the application chooses the approximate matching, we set up the structure to allow a total of two errors, which can be a combination of zero or one substitution, zero to two deletions and up to two insertions.
using (PatternMatcher
search = new PatternMatcher(m_ocrXpress))
{
List<MatchResult>
searchResults;
search.Pattern = txtSearchPattern.Text;
if (chkMatchApproximate.CheckState == CheckState.Checked)
{
search.MaximumInsertions = 2;
search.MaximumDeletions = 2;
search.MaximumSubstitutions = 1;
search.MaximumErrors = 2;
}
search.CaseSensitive = chkCaseSensitive.Checked;
searchResults = search.PerformMatching(m_ocrXpressPage);
}
Note that in this example image, one occurance of the word “OCR” was damaged with an ink blot (via a paint program), causing the “O” to look like a “Q”. Standard regular expression engines would not match this pattern when searching for “OCR Xpress”, but when we turn on approximate matching, it does find this occurance, as well as several other occurances where the space between “OCR” and “Xpress” is eleminated.
Display the Results In the List Box
To display the results we built an array of match results in a System.Collections.ArrayList
and tied it to the Windows.Forms.ListBox
control for the display. We populated the ArrayList, listBoxItems, with a fragment of the text line that includes the match.
PageResult page =
m_ocrXpressPage.GetResult();
foreach (MatchResult
result in searchResults)
{
TextBlockResult block =
page.GetTextBlockResult(result.TextBlockIndex);
collecting
if (result.TextLineStartIndex ==
result.TextLineEndIndex)
{
In this
match ends.
TextLineResult line =
block.GetTextLineResult(result.TextLineStartIndex);
int wordsBeforeIndex =
GetStartIndexOfWordsBefore(line.Text,
result.CharStartIndex, 1);
string itemString =
line.Text.Substring(wordsBeforeIndex,
result.CharStartIndex - wordsBeforeIndex);
itemString += "[";
itemString += line.Text.Substring(
result.CharStartIndex, result.CharEndIndex - result.CharStartIndex);
itemString += "]";
int wordsAfterIndex =
GetEndIndexOfWordsAfter(line.Text,
result.CharEndIndex, 8);
itemString += line.Text.Substring(result.CharEndIndex,
wordsAfterIndex - result.CharEndIndex);
}
else
{
}
}
Highlight on the Screen
An event from the ListBox calls a function that uses NotateXpress
to highlight the text on the image. OCR Xpress provides the coordinates of the characters in the image.
highlighting
if (result.TextLineStartIndex == result.TextLineEndIndex)
{
if (result.CharStartIndex ==
result.CharEndIndex)
{
result
"$".
return;
}
rectAnnotation = new RectangleTool();
rectAnnotation.BackStyle = BackStyle.Translucent;
rectAnnotation.FillColor = fillColor;
rectAnnotation.Moveable = rectAnnotation.Sizeable = false;
In this
match ends.
textLine = textBlock.GetTextLineResult(result.TextLineStartIndex);
that
not
int i1, i2;
for (i1 = result.CharStartIndex; i1 <
result.CharEndIndex; i1++)
{
firstCharacterResult = textLine.GetCharacter(i1);
if (firstCharacterResult.Text != " ")
break;
}
that
not
for (i2 = result.CharEndIndex - 1; i2 >= i1;
i2--)
{
lastCharacterResult = textLine.GetCharacter(i2);
if (lastCharacterResult.Text != " ")
break;
}
result
use
to
System.Drawing.Rectangle boundingRectangle =
new System.Drawing.Rectangle();
boundingRectangle.Y = textLine.Area.Y;
boundingRectangle.Height = textLine.Area.Height;
boundingRectangle.X = firstCharacterResult.Area.X;
boundingRectangle.Width = (lastCharacterResult.Area.Width +
lastCharacterResult.Area.X) - firstCharacterResult.Area.X;
rectAnnotation.BoundingRectangle = boundingRectangle;
layer.Elements.Add(rectAnnotation);
}
After the image is highlighted, we adjust the scroll position so the highlighted text is on the screen.
int xVis = rectAnnotation.BoundingRectangle.X +
rectAnnotation.BoundingRectangle.Width / 2;
int yVis = rectAnnotation.BoundingRectangle.Y +
rectAnnotation.BoundingRectangle.Height / 2;
double xOffset = xVis * m_imageXView.ZoomFactor
- m_imageXView.Width / 2;
double yOffset = yVis * m_imageXView.ZoomFactor
- m_imageXView.Height / 2;
m_imageXView.ScrollPosition = new Point((int)xOffset,
(int)yOffset);
Redact and Export
Finally, if the user is happy with the search results, they can redact the text that was matched and export the redacted text to a searchable PDF. We use NotateXpress to brand the redactions into the image, and then replace that image in OCR Xpress just prior to export. The underlying text is also redacted by replacing the offending text with “X”s, while the rest of the text is still searchable.
List<MatchResult>
results = new List<MatchResult>();
foreach (ResultListBoxItem
item in listBoxItems)
{
results.Add((MatchResult)item.ValueObject);
}
using (ImageXView
imageXView = new ImageXView(m_imagXpress))
{
imageXView.Image = m_imageXView.Image.Copy();
RedactSearchResultsOnImage(results, imageXView);
using (System.Drawing.Bitmap
redactedImage = imageXView.Image.ToBitmap(false))
{
m_ocrXpressPage.Bitmap = redactedImage;
}
RedactSearchResultsInCurrentPage(results);
m_ocrXpress.Document.Export(ExportFormat.PdfImageWithText,
saveFileDialog.FileName);
}
Conclusion
Developers can create powerful image search solutions when they combine accurate OCR with approximate regular expressions engines. This technology can solve a number of common problems that are pervasive across various industries. The simple OCR and search example that we created here is a demonstration of how the OCR Xpress and ImagXpress SDKs can make creating such powerful business solutions so easy.
You can find Pegasus Imaging product downloads and features at www.pegasusimaging.com. Please contact us at sales@jpg.com or support@jpg.com for more information.