Recognition technologies are bespoke solutions, which means they are mostly behind the scenes. But without them, some of our favorite and useful modern applications would provide 0 value. The friendly voice assistance on your phone - speech recognition, the search capabilities in most productivity tools - optical character recognition (OCR), and the automated organization of content from your favorite online store - natural language processing (NLP). In this post I will explain how quality imaging is the most critical element for an application I’m working on to support both OCR and NLP. And how I’m using the MobileImage SDK to do it.
Due to a deep background in recognition technologies, one of my clients approached me for advice on the best way to leverage imaging, OCR, and NLP for their rapid record collection application. They needed the ability, via a mobile application, to quickly capture images, get as accurate as possible OCR, and then logically group documents by entities returned from an NLP engine.
The end game is to improve navigation of the data based on critical keywords, without manual organization. The reason manual organization is not an option is that the data gets stale very quickly (within hours), so the faster it can be consumed the better.
Why imaging matters - and not all the same
I’m sure you have heard the phrase "garbage in, garbage out". No matter how cliche it is, when it comes to these technologies the earlier phases of capture directly impact the quality of final output. And the first most critical step in any application of this type is to get quality images for the OCR engine to process on.
I’m not an avid mobile developer, but I know enough to build a strategy around the best ways to leverage an imaging API. And provide my client the proper implementation of imaging for them to build on top of. I decided to use Atalasoft’s mobile imaging SDK. And engine i’m familiar with many years back when building a .NET imaging application for invoice processing.
When it comes to how to capture an image there are thousands of combinations of all the possible levers to push and pull file type, color depth, resolution, cropping, skew, noise removal etc. There are consumer based imaging APIs out there, but they do not have the experience in document imaging, and enterprise quality results.
It took me a while to get started with the SDK. But once I got the framework up, and some basic operations, take photo, process photo. It was fast. All of the magic with Atalasoft happened when the user clicked the "Process Image" button.
Implementation
First I ran kfxKEDQuickAnalysisFeedback
method to give me some basic information on the image quality. Most importantly isBlurry, if true I force the user to capture another image. Dithering is a big enemy to OCR. The other aspects are tolerable such as saturation, because the bitonal conversion threshold usually handles this just fine. And worse case is inversion, which is still OCR-able.
With a good snap, I could then do further image processing to improve quality. For optimum OCR results the ideal settings are:
- Format: TIFF Group 4 - MIMETYPE_TIF
The top two OCR engines always process on TIFF Group 4 images. If you don’t convert it, the engine will. So controlling the conversion upfront based on what you know about the images will produce better quality, and improve processing speed, because the OCR engine does not need to waste time doing the conversion. Also because TIFF can be lossless which is critical as well.
- Color: B&W KEDOutputColor{ KED_BITDEPTH_BITONAL=1}
In an image that is originally color, identifying the document is best done by converting it to a black and white image. And because the output results of processing are more important than the image, OCR works best on Bitonal so we error in quality not beauty.
- Resolution: 300 DPI
300 DPI is the best combination of quality and speed. It is more accurate at a higher quality but the sending of the image is slowed and so is OCR. But less than 300 DPI losses accuracy fast. In mobile, however, this one is a a serious trick because the phone’s camera dictates the original DPI. So it has to be converted. And you can only do that after it’s cropped. Conversion is risky, but can be done accurately on the device.
- Cropping: KEDCroppingOptions{KED_CROP_AUTO}
The more you can reduce the image and focus the eyeballs on content, the better for performance and accuracy. I was pleasantly surprised to find that Atalasoft’s auto crop was very accurate. Due to the nature of mobile capture it would not be possible to specify the crop, and too slow to have the user do it. So without bad auto crop I could not user cropping at all.
All the basic image settings you can do with built in classes and enumerators. I also used the kfxKEDImagePerfectionProfile
to reference an XML file with more advanced settings. Which resolution change, included noise removal, smoothing, and skew by content.
Because of the nature of the images the order of operations is important:
color > format > cropping > and resolution > cleanup in advanced profile
My utilization of the API was focused on the image processing. They also have their own capture controller. I decided not to use it, just in case there were changes in how the application captures images, and to isolate capture from processing.
After all the good stuff the image is then sent to a cloud-based OCR server I have set up. The reason OCR is done in the cloud is because OCR is very CPU intensive, and I need heavy processing power to handle the volume in a reasonable time frame. However my next step is to implement barcode reading prior to cloud based OCR. In some cases this will avoid the need for OCR all together.
MobileImage SDK supports this, and the accuracy is very good. It pretty much can read the barcode in any location at most any angle. I will also add image guides. One of the hardest things for images on a mobile device is the angle of the camera. If it is not perfectly flat it will impact cropping and quality of text. The skew feature in the SDK can be used to help this a little, but the best thing is to just give the user a rectangle guide.
From the user's perspective this is all that they have do. The results are processed by the OCR server, then stored and organized according to entities (person, place, proper names, etc.) found with an NLP engine. They are then available for discovery purposes.
The MobileImage SDK wins for high quality output images. Where I think they could do better is not on their processing but the ease of use, and architecture of the SDK.
- The API design, class names, structure, enumerators, etc. are funky. You really would have to be in the know (work for them) or spend a lot of time figuring out what each does. Not a great naming convention, or organization. Especially for my type of coding, which is not to ready documentation in advance, rather use code completion to find out what I need.
- I was hoping the documentation would help. The documentation is only found with the distribution of the API. So you launch it locally. I wish it had search across all elements, was hosted on their site, and had better code snippets.
- And finally I love samples, the best way to learn is by seeing. But the one code sample I received had all functionality in one place and was not great. Mostly because all basic functionality was crammed into one application, and not great commenting. I find it better to see just specific user flows in smaller applications.
But all of these I understand are an artifact of essentially being a port of an existing and established client server SDK, so it makes sense in that light. Also XCode is not so easy to use, in retrospect I should have used Xamarin as the IDE. When I originally started the project I wanted to try it using SWIFT, but I could not get it to work.
Anyway, my application is very simple so an iOS Objective-C project was fine. They also have support for Android and I suspect that usage is a bit easier there.
Once I got going it was not all that hard. And the most important part was solving the no garbage in problem. After testing on 30 images I could prove that my accuracy of OCR and then final organization was around 90%. The 10% reduction was not on image processing which had only a 5% image rejection rate, but OCR accuracy, and finally NLP which is the least accurate of all the technologies.
I would recommend the MobileImage SDK over the more simple imaging SDKs for any serious image processing. And for any application that is not simply consumer based and requires consistency, and robust image support.