Introduction
A PDF document can contain a collection of fields for gathering information from the user. This project allows you to extract the data stored in these fields. This project is dependent on the PDF File Analyzer With C# Parsing Classes (Version 2.1). The software implements Section 8.6 Interactive Forms of “PDF Reference, Sixth Edition, Adobe Portable Document Format Version 1.7 November 2006”.
To extract the data fields, you supply a PDF file name. If the file is encrypted, you need to supply the password. The library will open the file and read its main structure. Next, it will read the interactive data fields. The result is an array of fields containing the field names and user entered data. You can serialize this array to an XML file.
Executing the PDFExtractFormData Demo Program
Start the program. Press Open PDF File button. Use the Open file dialog to open a PDF file containing interactive data fields. The demo program will display the number of pages in your document. The number of indirect objects. The number of interactive fields of data in your document. And the number of digital signatures. Press Save Form Data and the program will save it to an XML file with the same name as your PDF. The XML file will be displayed by Notepad.
Integrating the Software to Your Application
Save the PdfFileAnalyzer.dll library within your development area. Add the three source files included in the distribution PdfGetFormData.cs, PdfFieldData.cs and PdfFormData.cs to your project. Add the namespace PdfExtractFormData
to your using
list. Add reference to the PdfFileAnalyzer.dll. Add using PdfFileAnalyzer
to your source.
Create a PDF form data reader. This is a derived class of PdfReader
from the PdfFileAnalyzer
.
PdfGetFormData GetFormData = new PdfGetFormData();
Open the PDF file.
bool result = GetFormData.OpenPdfFile(PdfFileName);
bool result = GetFormData.OpenPdfFile(PdfFileName, Password);
All errors, except encryption errors, will throw an exception. The returned result is true
if the file was opened successfully. If the file is password protected, and password was not given or was wrong, the result is false
. Examine the property DecryptionStatus
. If it is InvalidPassword
and if you know the password, you can provide it with TestPassword
method. Please review the PdfExtractFormData.cs for full example.
bool Result = GetFormData.TestPassword(Password);
After the file is successfully open, get the interactive data fields.
PdfFieldData[] FieldDataArray = GetFormData.GetFields();
If FieldDataArray
is null
, the PDF file does not have interactive data fields.
The PDF document form data is stored in an array of PdfFieldData
elements. Within the PDF documents, these fields are organized in hierarchical structure. The index field (zero based) is an index into the PdfFieldData
array. The Parent
field is the index of the parent of the current field. Parent field of -1
indicates a root field. You can navigate from any field back to the root.
The next 4 fields are defined in the PDF specification manual page 675 table 8.69. If the field Type (Key=FT
) is blank, it is not a data field. It is part of the tree hierarchy. There are 4 types of data fields: Button (Btn
), Text (Tx
), Choice (Ch
) and Signature (Sig
). Each field has a name (Key=T
) and an alternate- name (Key=TU
) and a value (Key=V
). The name and the alternate name are assigned by the PDF document creator. The value is entered by the user of the document. If the value is an empty string
, the user did not enter a value. The values for buttons and choices are taken from a built-in list assigned by the document creator. The user selects the value from the list. If a choice field has multi-choice capability, the selected choices will be separated by end of line. Signature fields are handled differently than the other types of fields. Signature case is described below:
public class PdfFieldData
{
public int Index;
public int Parent;
public string Type;
public string Name;
public string AltName;
public string Value;
}
If the field Type
is signature, the value field is “Signature n
”. Where n
is the index number of the signature into a result array. If there is one signature, the value will be Signature 0
. And the array has one element. The data associated with all the signature fields is stored in an array of PDF dictionaries.
PdfDictionary[] Signatures = GetFormData.SignatureArray;
If(Signatures == null) {
PdfDictionary Signature = Signatures[0];
Digital signatures are described in the PDF specification document in Section 8.7 page 725. The signature dictionary is detailed in Table 8.102 on page 727. The PdfFileAnalyzer
library allows you to extract any entry in this dictionary.
The button “Save First Signature” allows you to save a signature dictionary in a text format. It will be saved to a file name as the PDF file but with extension “.sig”. The conversion of signature dictionary to text is done by:
string Text = GetFormData.SigDictionary(index);
The PdfGetFormData
is providing one method as an example how to extract data from the signature dictionary. The example shows how to extract the encrypted byte array of the signature.
byte[] Contents = GetFormData.SigContents(index);
History
- 2019/06/17: Version 1.0, Original revision