Introduction
Document imaging is certainly saving trees and physical
storage space, but in some situations it fails to save much time or hassle.
Opting in to paperless statements or manually scanning paper documents yourself
is a great way to archive all of your bills, invoices, financial statements and
the like. However, it still requires a fair amount of time and energy to
thoughtfully organize the documents in your digital filing cabinet. After all,
what good is it to digitally archive your documents if they become near
impossible to find them again in the future?
That scenario might not be too overwhelming for an
individual with a good memory and habits, but what about medium to large
businesses and corporations that deal with thousands of documents on a daily
basis and has hundreds of people working with the same digital archive?
Without some kind of automation, you have a huge overhead of man-hours and –
what’s even more problematic – a wide opportunity for human error.
Imagine being able to drop all of your scanned documents
into a single folder and have all the work of moving and renaming the files in
a logical, consistent manner done automatically. LEADTOOLS
Forms Recognition and Processing fits the bill perfectly with its high
level, flexible and powerful imaging libraries. Applications built with
LEADTOOLS can compare a scanned document against known templates and correctly
classify the document type. After a document is correctly identified,
LEADTOOLS can then extract OCR, OMR, Barcodes and more from defined locations
on the form.
Processing the Document Repository
The first step to solving this dilemma is to process and
manage a central location where all of the scanned documents are placed for
classification. There are multiple ways of accomplishing this, such as using a
web service or windows service or a monkey pressing a button. The method
chosen in this example is a simple console application which is then scheduled
to run with Windows’ task scheduler.
The code that manages the repository is relatively simple
since it primarily uses basic file and folder operations with the System.IO
namespace. However the most crucial part of the
application is passed on to the DocumentClassifier
which
encapsulates the LEADTOOLS Forms Recognition features to return the data used
for moving and renaming the documents.
string[] newDocuments = Directory.GetFiles(docRepositoryNewDocs);
DocumentClassifier docClassifier = new DocumentClassifier(docRepositoryMasterForms);
string movedDocumentName, masterFormSubFolder;
foreach (string currentDoc in newDocuments)
{
movedDocumentName = null;
ClassifiedDocument classifiedDoc = docClassifier.ClassifyDocument(currentDoc);
if (classifiedDoc.MasterFormName != null)
{
masterFormSubFolder = string.Format(@"{0}{1}\",
docRepositoryRoot,classifiedDoc.MasterFormName);
if (!Directory.Exists(masterFormSubFolder))
Directory.CreateDirectory(masterFormSubFolder);
if (classifiedDoc.DocumentDate != DateTime.MinValue)
{
movedDocumentName = string.Format("{0}{1}{2}",
masterFormSubFolder,
classifiedDoc.DocumentDate.ToString("yyyyMMdd"),
currentDoc.Substring(currentDoc.LastIndexOf('.'),
currentDoc.Length - currentDoc.LastIndexOf('.')));
}
else
{
movedDocumentName = currentDoc.Replace(docRepositoryNewDocs, masterFormSubFolder);
}
}
else
{
movedDocumentName = currentDoc.Replace(docRepositoryNewDocs,
docRepositoryUnclassifiedDocs);
}
if (!string.IsNullOrEmpty(movedDocumentName))
File.Move(currentDoc, movedDocumentName);
}
Using LEADTOOLS Forms Recognition
Before LEADTOOLS can start classifying documents it must
know how to classify them, which is accomplished by creating a collection of
Master Form templates. LEADTOOLS ships with a Master Form editor demo which we
will use to add a master form for two different invoices containing a single OCR
field that extracts the invoice date which can be used to rename the file.
Figure 1: Defining Master Form Templates with the Master Form Editor
Now that our master forms are defined, we are ready to
process the documents. We have scanned two filled out invoices based on the
master forms, and a tax form which does not have a known template. For each
file in the "New" folder, LEADTOOLS will compare it against the master
templates. If a match is found, it will then process the document’s fields and
return the form’s name and the date field.
ocrEngines = new List<IOcrEngine>();
for (int i = 0; i < Environment.ProcessorCount; i++)
{
ocrEngines.Add(OcrEngineManager.CreateEngine(OcrEngineType.Advantage, false));
ocrEngines[i].Startup(formsCodec, null, String.Empty, String.Empty);
}
formsRepository = new DiskMasterFormsRepository(formsCodec, _MasterFormFolder);
autoEngine = new AutoFormsEngine(formsRepository, ocrEngines, null, AutoFormsRecognitionManager.Default | AutoFormsRecognitionManager.Ocr, 30, 70, true);
AutoFormsRunResult runResult = autoEngine.Run(document, null);
if (runResult != null)
{
retClassifiedDocument.MasterFormName = runResult.RecognitionResult.MasterForm.Name;
foreach (FormPage formPage in runResult.FormFields)
{
foreach (FormField field in formPage)
{
if (field != null && field.Name == "ClassificationRenameDate")
{
retClassifiedDocument.DocumentDate = DateTime.Parse((
field.Result as TextFormFieldResult).Text);
}
}
}
}
Figure 2: Forms Repository and Sub Folders Before Classification
Figure 3: Forms Repository and Sub Folders After Classification
As you can see, the two invoices were correctly matched to
their master form and renamed based on the date field. Additionally, the
unclassified documents folder acts as a fail-safe, letting the application grow
and adapt with minimal effort. When you have a new document type that is not
in your master forms collection, all you have to do is use one of those images
as a template, add the fields you want to extract, and move the unclassified
documents back into the New folder to get processed again the next time the
application runs.
Taking it a Step Further
This simple solution has massive potential for expansion and
adaptation. For example, you could easily manage your documents online by
connecting to a cloud service such as Google Docs, SkyDrive or iCloud.
Similarly, businesses could adapt it to monitor and organize incoming faxes and
email attachments or use the recognized field data and store it directly into a
database. Most importantly, LEADTOOLS Forms Recognition can process as much or
as little information from the scanned documents as you desire, stretching its
usefulness far beyond mere organization and archival. Form fields, check
boxes, invoice amounts, and much more can be extracted to speed up any
workflow.
Download the Full Forms Recognition Example
You can download the fully functional demo which includes
the features discussed above. To run this example you will need the following:
Support
Need help getting this sample up and going? Contact
our support team for free technical support! For pricing or licensing
questions, you can contact our sales team (sales@leadtools.com)
or call us at 704-332-5532.