Extract User Data Fields From Fillable PDF Document

Uzi Granot

5.00/5 (2 votes)

19 Jun 2019CPOL4 min read

17.7K

1.3K

Application to extract user data from fillable PDF document. It is based on PdfFileAnalyzer project (version 2.1)

Introduction

A PDF document can contain a collection of fields for gathering information from the user. This project allows you to extract the data stored in these fields. This project is dependent on the PDF File Analyzer With C# Parsing Classes (Version 2.1). The software implements Section 8.6 Interactive Forms of “PDF Reference, Sixth Edition, Adobe Portable Document Format Version 1.7 November 2006”.

To extract the data fields, you supply a PDF file name. If the file is encrypted, you need to supply the password. The library will open the file and read its main structure. Next, it will read the interactive data fields. The result is an array of fields containing the field names and user entered data. You can serialize this array to an XML file.

Executing the PDFExtractFormData Demo Program

Start the program. Press Open PDF File button. Use the Open file dialog to open a PDF file containing interactive data fields. The demo program will display the number of pages in your document. The number of indirect objects. The number of interactive fields of data in your document. And the number of digital signatures. Press Save Form Data and the program will save it to an XML file with the same name as your PDF. The XML file will be displayed by Notepad.

Integrating the Software to Your Application

Save the PdfFileAnalyzer.dll library within your development area. Add the three source files included in the distribution PdfGetFormData.cs, PdfFieldData.cs and PdfFormData.cs to your project. Add the namespace PdfExtractFormData to your using list. Add reference to the PdfFileAnalyzer.dll. Add using PdfFileAnalyzer to your source.

Create a PDF form data reader. This is a derived class of PdfReader from the PdfFileAnalyzer.

PdfGetFormData GetFormData = new PdfGetFormData();

Open the PDF file.

// open without a password
bool result = GetFormData.OpenPdfFile(PdfFileName);

// or, open with a password
bool result = GetFormData.OpenPdfFile(PdfFileName, Password);

All errors, except encryption errors, will throw an exception. The returned result is true if the file was opened successfully. If the file is password protected, and password was not given or was wrong, the result is false. Examine the property DecryptionStatus. If it is InvalidPassword and if you know the password, you can provide it with TestPassword method. Please review the PdfExtractFormData.cs for full example.

// set the password
bool Result = GetFormData.TestPassword(Password);

After the file is successfully open, get the interactive data fields.

// get form data
PdfFieldData[] FieldDataArray = GetFormData.GetFields();

If FieldDataArray is null, the PDF file does not have interactive data fields.

The PDF document form data is stored in an array of PdfFieldData elements. Within the PDF documents, these fields are organized in hierarchical structure. The index field (zero based) is an index into the PdfFieldData array. The Parent field is the index of the parent of the current field. Parent field of -1 indicates a root field. You can navigate from any field back to the root.

The next 4 fields are defined in the PDF specification manual page 675 table 8.69. If the field Type (Key=FT) is blank, it is not a data field. It is part of the tree hierarchy. There are 4 types of data fields: Button (Btn), Text (Tx), Choice (Ch) and Signature (Sig). Each field has a name (Key=T) and an alternate- name (Key=TU) and a value (Key=V). The name and the alternate name are assigned by the PDF document creator. The value is entered by the user of the document. If the value is an empty string, the user did not enter a value. The values for buttons and choices are taken from a built-in list assigned by the document creator. The user selects the value from the list. If a choice field has multi-choice capability, the selected choices will be separated by end of line. Signature fields are handled differently than the other types of fields. Signature case is described below:

public class PdfFieldData
    {
    public int Index;       // field index into the form array
    public int Parent;      // parent data field
    public string Type;     // Btn, Tx, Ch, Sig or empty
    public string Name;     // field name
    public string AltName;  // field alternate name
    public string Value;    // field value or empty
    }

If the field Type is signature, the value field is “Signature n”. Where n is the index number of the signature into a result array. If there is one signature, the value will be Signature 0. And the array has one element. The data associated with all the signature fields is stored in an array of PDF dictionaries.

// get signature array
PdfDictionary[] Signatures = GetFormData.SignatureArray;
If(Signatures == null) {// there are no signatures}

// get the first signature information
PdfDictionary Signature = Signatures[0];

Digital signatures are described in the PDF specification document in Section 8.7 page 725. The signature dictionary is detailed in Table 8.102 on page 727. The PdfFileAnalyzer library allows you to extract any entry in this dictionary.

The button “Save First Signature” allows you to save a signature dictionary in a text format. It will be saved to a file name as the PDF file but with extension “.sig”. The conversion of signature dictionary to text is done by:

string Text = GetFormData.SigDictionary(index);

The PdfGetFormData is providing one method as an example how to extract data from the signature dictionary. The example shows how to extract the encrypted byte array of the signature.

byte[] Contents = GetFormData.SigContents(index);

History

2019/06/17: Version 1.0, Original revision

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)