Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Implementing a Standardized PDF/A Document Storage System with LEADTOOLS

28 Feb 2014 1  
This white paper will explore how to take full advantage of PDF/A as your universal document storage format by using the state of the art technology with LEADTOOLS Document Imaging SDKs.

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

Introduction

Electronic document archival has evolved far beyond the simple days of scanning a paper document and saving it as an image or PDF. Nowadays, many documents don't even start in physical form and could be one of many open or proprietary formats. Adding to the disparity caused by varying file formats is how and where files are stored. Many enterprises have their documents spread around numerous "data islands" including local computers, networked file shares and cloud services. Finally, the prevalence of mobile devices and tablets which may or may not support some formats further reinforce the need for standardized document archival.

Companies run on information, and as digital archives grow in both scale and diversity the ability to efficiently and accurately find data within them often fails to keep up. PDF/A is built for this purpose, but migrating all of your various file formats remains a challenge since raster image formats such as TIFF and JPEG have little to no searchable features beyond the file name. This white paper will explore how to take full advantage of PDF/A as your universal document storage format by using the state of the art technology within LEADTOOLS Document Imaging SDKs.

Creating a Searchable Document Archive with PDF/A

For years, PDF has been widely recognized and adopted as the best format for document archival, content management, record retention, risks management, litigation and discovery. This is especially true for the PDF/A sub-format which is specifically designed with archival and future-proofing in mind. PDF/A is completely self-contained and stores fonts, color management, annotations, images and more within the file itself. This ensures the document will stay true and not change its appearance for years on end while operating systems, devices, monitors and default fonts change all around it.

Normalizing your archive will yield many benefits in storage allocation, productivity and costs. The problem of being able to find and view your documents is drastically reduced since PDF is such a widely supported format. Making the choice to use PDF/A as your sole document archival format is certainly wise, but only solves a small part of the overall problem. Yet to be addressed are the issues of converting a divergent archive and ensuring that all further storage is done in a uniform fashion.

A handful of applications and scanners natively come with the ability to save as PDF, but can be unnecessary and cost prohibitive. In addition, documents can come from many sources both inside and outside your organization so at some level your documents must be processed and converted. Without a well designed and automated process, the benefits of a normalized archive are hard to fully realize. Many organizations therefore shy away from going fully digital due to the challenges involved in properly correcting and maintaining their newly envisioned document storage system. Therefore they feel trapped in knowing they need to change but do not know how to accomplish their goals in a holistic and cost effective manner.

Making it All Possible with LEADTOOLS Document Imaging SDKs

If all or part of this situation sounds familiar, look no further than LEADTOOLS. Its Document Imaging SDKs cover the gamut of imaging technology needed to make a universal PDF/A document archive a reality.

Full PDF and PDF/A File Format Support

LEADTOOLS provides full control over the PDF format including advanced capabilities such as extracting text, hyperlinks, bookmarks and metadata as well as updating, splitting and merging pages from existing PDF documents. With LEAD Technologies' decades of expertise in image compression, its PDF SDK also offers the industry's best performing and most diverse PDF compression options including JBIG, JPEG2000 and Mixed Raster Content. Also included are features often difficult to find in similar commercial SDKs, including reading, displaying, editing and writing native PDF annotations and markup that work seamlessly with Adobe Acrobat and other compliant PDF viewers.

Rather than being at the mercy of the PDF file format and the often exorbitant costs of PDF editing capabilities, LEADTOOLS will open up incredible opportunities for your archival system and keep all the decision making and customization in your court.

Optical Character Recognition (OCR) and Conversion

LEADTOOLS comfortably tackles the problem of migrating an existing archive with mixed file formats to a unified PDF/A archive. With the ability to load, save and convert over 150 raster, vector and document file formats, you can rest assured that you will have your bases covered.

Since not all formats are text-based and searchable, LEADTOOLS can use its fast and highly accurate Optical Character Recognition technology to convert those images to searchable PDF/A. The advanced OCR SDK in LEADTOOLS supports over forty languages and character sets including English, Spanish, French, German, Japanese, Chinese, Arabic and more, making it a reliable solution for the largest of enterprises running and providing services in multiple countries across the globe.

Most text-based PDF files also have smaller file sizes than the original raster image from which they were converted. Moreover, all of this can be done in as few as three lines of code.

IOcrEngine ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.Advantage, false);
ocrEngine.Startup(null, null, null, null);
ocrEngine.AutoRecognizeManager.Run(_strInputFile, _strOutputFile, DocumentFormat.Pdf, null, null);

Virtual Printing

If there is anything that the vast majority of applications have in common it's the ability to print. This is, after all, where the need for document archival started. Instead of printing documents to paper and then later using scanners and OCR to convert them back into a searchable digital medium, the LEADTOOLS Virtual Printer can get it done right from the start.

This approach not only handles the documents which you would normally print, but also allows you to archive many other sources of information including emails, faxes, website, social media and virtually any file format. As an added benefit, the vast majority of documents and materials you print are textual which means the resulting PDFs will already be searchable and require no special processing and are 100% accurate to the original document.

DocumentWriter _documentWriter;

public void _printer_EmfEvent(object sender, EmfEventArgs e)
{
   // Create a new document page and pass the EMF in e.Stream
   DocumentPage documentPage = DocumentPage.Empty;
   documentPage.EmfHandle = new Metafile(e.Stream).GetHenhmetafile();

   // Load EMF as raster for image over text
   e.Stream.Position = 0;
   documentPage.Image = _codec.Load(e.Stream);

   // Add the page
   _documentWriter.AddPage(documentPage);
}

public void _printer_JobEvent(object sender, JobEventArgs e)
{
   if (e.JobEventState == EventState.JobStart)
   {
      // Initialize DocumentWriter
      PdfDocumentOptions pdfOptions = new PdfDocumentOptions();
      pdfOptions.DocumentType = PdfDocumentType.PdfA;
      pdfOptions.FontEmbedMode = DocumentFontEmbedMode.Auto;
      pdfOptions.ImageOverText = true;

      _documentWriter = new DocumentWriter();
      _documentWriter.SetOptions(DocumentFormat.Pdf, pdfOptions);
      _documentWriter.BeginDocument(_pdfFileName, DocumentFormat.Pdf);
   }
   else if (e.JobEventState == EventState.JobEnd)
   {
      // Add fonts and end the document
      AddAndInstallFonts(e.JobID);
      _documentWriter.EndDocument();
            
      // Load PDF
      System.Diagnostics.Process.Start(_pdfFileName);
   }
}

Finally, LEADTOOLS Virtual Printers can also be configured to run on a server and made accessible over your company's LAN or the web with Internet Printing Protocol (IPP). This flexibility makes Virtual Printing an excellent solution for maintaining your archive into the future by providing a large funnel into which nearly any piece of information can be printed and then automatically archived through a central business workflow process.

HTML5 Zero Footprint Viewer

Just because you are saving your documents as PDF doesn't mean you can't benefit from a viewer. Though PDF is so widely adopted that few think about someone not being able to load it, plug-ins and viewing applications are still required in most situations. By using the HTML5 and JavaScript based viewer in LEADTOOLS, you can build a true cloud-based image viewing solution which requires no plug-ins or downloads. All of the heavy image processing and display is done on the client-side, yielding fast display times and a responsive user interface.

Conclusion

With LEADTOOLS, standardizing your document storage to PDF/A is no longer an arduous, complex and costly endeavor. Everything you need to convert your existing files, manage and normalize your PDFs, and create all-inclusive business workflows is included in programmer-friendly libraries for multiple platforms. You can rest easy knowing that all the information your company relies on for efficient and productive operation will be properly archived and readily accessible.

Download the Full PDF/A, OCR and Virtual Printing Example

You can download the fully functional demo which includes the features discussed above. To run this example you will need the following:

  • LEADTOOLS free 60 day evaluation
  • Visual Studio 2008 or later
  • Browse to the LEADTOOLS Examples folder (e.g. C:\LEADTOOLS 18\Examples\) where you can find example projects for this and many more technologies in LEADTOOLS

Support

Need help getting this sample up and going? Contact our support team for free technical support! For pricing or licensing questions, you can contact our sales team (sales@leadtools.com) or call us at 704-332-5532.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here