Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Integrating Data Capture for Invoices and Other Semi-Structured Forms

1 Dec 2014 1  
This whitepaper describes the basic steps in deploying a .NET solution for automating accounts payable by using a development toolkit for semi-structured forms processing. This solution can be integrated into applications via API.

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

Introduction

Recent research conducted by IBM, Gartner and Merrill Lynch has reached the conclusion that roughly 80 percent of data in companies today is semistructured or unstructured.

With the growing reliance on metrics, electronic decision support, fully automated back-office functions, and even tools for leveraging Big Data, this mountain of unstructured data represents an enormous gap in the IT landscape, a vast store of data that cannot be exploited without costly, error-prone manual intervention to capture and order the data for use by automated systems.

Not surprisingly, IT operations are under growing pressure to get a handle on unstructured data. But that effort has been thwarted by a lack of effective and affordable technology, and also by an outmoded business culture. Despite half a century of corporate adoption of computerized business processes, data is still being generated in formats that are difficult to capture accurately.

In retail, healthcare, financial services, transportation and other industries, semi-structured forms, such as received invoices and contracts on paper or in Microsoft Word or PDF format, remain a core frontline business tool. The proliferation of data originating in unstructured emails, letters and faxes daunts attempts at cost-effective, accurate capture.

Efforts to rein in unstructured data have also been hampered by the fact that the solution requires not one mature technology, but several working in concert. Data capture from semi-structured forms, for example, requires advanced recognition technology to scan and correct the form image, plus accurate optical character recognition (OCR) technology, plus intelligent character recognition (ICR) to capture handwriting and signatures, plus an intelligent processing component to correctly map recognized data to database fields for use by other applications. If any one of these components is insufficiently fast and accurate, or can’t be easily integrated with the others, the entire data capture development project fails.

A new generation of mature recognition technologies has recently become available to help developers begin to address the missing 80 percent of corporate data. These solutions incorporate mature implementations of all of the necessary components for capture, and make the functions easy to integrate into applications through APIs.

As an example, this whitepaper describes the basic steps in setting up automated invoice capture which can then be easily integrated into an application via API. The SDK used can be downloaded for evaluation from www.accusoft.com/formsinvoicedownload.htm.

Basic Capture Concepts

The example software development kit (SDK) used in this whitepaper is FormSuite for Invoices from Accusoft. An evolution of Accusoft’s mature solutions for forms processing and cleanup, OCR and ICR, FormSuite for Invoices is designed to process a scanned image (or Word or PDF file) of an invoice, and then search for and extract data such as Invoice Number, Invoice Date, Total Due and line item detail from the image by organizing the data into label-data coupled pairs called FormFields.

The label in a FormField is the text on the document that defines the data. Wording such as "Invoice Number", "Date of Invoice", and "Invoice #" are examples of FormField Labels. To be more exact, these are "Label Aliases" for a FormField label. For example, "Invoice Total Amount Due" could be the name of a label, but on some documents this may not be the wording that identifies the Total Amount Due. To account for divergence in the label, each label may be assigned any number of aliases; for "Invoice Total Amount Due,” aliases could be "Pay this Amount", "Amount Due", "Total Due", and "Total".

The data in a FormField is the text on the document associated to the label by FormSuite for Invoices, wording such as "$100.00", "1/23/2014", or "Jan. 12, 2013."

FormSuite for Invoices also identifies and extracts line item data from tables found in the image of the invoice, and builds FormTables from that data. A FormTable is composed of a set of column headers and a set of row data lists.

During processing, a vendor is matched with the input invoice. This is accomplished by matching the vendor’s phone numbers and address found on the images with those posted in a list of vendors.

Once an invoice has been processed and verified, a template of that invoice is created and saved. This template will be used in future runs for other invoices from the same vendor, to improve the accuracy of the field matching. Each time the template is used it is updated with new location and size data calculated from the old location and size data and from the new information found and verified for all the fields identified on the document.

Deployment Steps

After performing typical initial .NET steps for installation, distribution and licensing, deployment of FormSuite for Invoices requires some simple coding to:

  • Load & Process Invoices
  • Validate Results
  • Update Templates

To load and process invoices, a list of image identifiers must be passed to the processor component. The identifiers are full paths to image files. A few other base identifiers, such as CompanyData, FormDefinition, and a Vendor List, also must be provided to the invoice processor before any processing can be performed upon the images.

The code sample below illustrates the basics of loading invoice images to the processor:

public void ProcessInvoices(List<string>         fileNames,
                            FormDefinition       formDefiniton,
                            ICompanyData         companyData,
                            IEnumerable<IVendor> vendorList)
{
    using ( Licensing license = new Licensing())
    {
        // Setup the license keys
        license.SetSolutionName("YourSolutionName");
        license.SetSolutionKey(12345, 12345, 12345, 12345);
        license.SetOEMLicenseKey("2.0.AStringForOEMLicensingContactAccusoftSalesForMoreInformation&hellip;");

        // Create a processor for invoicing
        using (Processor invoiceProcessor = new Processor(license))
        {
            // Set the initial data
            invoiceProcessor.CompanyData = companyData;
            invoiceProcessor.VendorList = vendorList;
            invoiceProcessor.FormDefinition = formDefiniton;

            // Run Analyze on the file
            AnalyzeResults results = invoiceProcessor.Analyze(fileNames);
        }
    }
}

The output data is collected into the FormResults list. Each FormResult contains the data found on all the pages of an invoice document.

Validation is actually a manual operation. The user must review the results on-screen and then accept, edit or reject them. Setting the "UserValidated" flag in Fields and Tables validates them. This flag is also used in template processing to give greater preference to these "validated" values and locations.

The final step is updating templates. FormSuite for Invoices has the ability to create new templates; however, the developer’s application must manage and maintain all the templates. A simple interface component is included to provide all the tools required to allow FormSuite for Invoices and the developer’s application to pass templates back and forth. The following code example demonstrates where the template operation fits into the application’s processing scheme.

// Create an instance of the license class
using (Licensing license = new Licensing())
{
   license.SetSolutionName("YourSolutionName");
   license.SetSolutionKey(12345, 12345, 12345, 12345);
   license.SetOEMLicenseKey("2.0.AStringForOEMLicensingContactAccusoftSalesForMoreInformation&hellip;");

   // Update the templates for the form result
   TemplateIO.UpdatePageTemplates(templateProvider, formResult, companyData, vendor, true, license);
}

Once these basic steps are complete, FormSuite for Invoices is not only ready to process invoices, but also ready to be put under programmatic control via the API which is fully documented in the FormSuite for Invoices Help file.

Summary

FormSuite for Invoices represents only the first step in helping developers easily bring semistructured and unstructured data into automated systems. In the future, Accusoft will be delivering SDKs for processing additional form types such as purchase orders, shipping documents, and health forms. In the meantime, developers can learn more about FormSuite for Invoices, and download the full SDK for trial, at www.accusoft.com/formsinvoice.htm.

About the Author

Prior to joining Accusoft, Ned Averill-Snell was a global best practices analyst for PricewaterhouseCoopers. A longtime computer journalist and author, he is a past contributing editor to DATAMATION and Inside Technology Training magazines and the author of two-dozen books about IT.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here