Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

A PDF Forms Parser

0.00/5 (No votes)
22 Jun 2006 1  
A parser for PDF Forms written in C#.NET.

Introduction

Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe� Acrobat�. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background

PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

2774 0 obj
<< 
/Type /Annot 
/Subtype /Widget 
/Rect [ 27.09381 776.96008 194.09021 789.76807 ] 
/F 4 
/P 1996 0 R 
/AP << /N 14 6 R >> 
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx 
/Ff 4194304 
/DV (Smith)
/V (Smith)
>> 
endobj

Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser

The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.

The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

  1. Parse cross reference table(s) identifying byte offsets for all objects.
  2. Parse AcroForm dictionary object identifying form field object identifiers.
  3. Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.

Using the code

The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
// read the file and parse it

PdfReader reader = new PdfReader(filename);

// change one text field

try
{
    ((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out

FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();

Most properties of fields are accessible through properties in .NET as well, e.g.:

// a radio button

PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.

f.SelectedItem = "MasterCard";
// one button must be pressed

f.NoToggleToOff = true;

// a check box

PdfCheckBoxField f = ...;
// check it

f.Checked = true;

// a text field

PdfTXField f = ...;
// set the text

f.Text = "Hello, World.";
// mark it as a password field

f.Password = true;

// a combo or list box

PdfCHField f = ...;
// render as combo box

f.Combo = true;
// more than one item is selectable

f.MultiSelect = true;
// select items 1 and 3

f.SetSelectedIndexes(1, 3);

Points of Interest

  • The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
  • The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
  • The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
  • For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
  • Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.

Tools used

I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.

Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History

  • August 19, 2004: Version 1.0.
  • August 26, 2004: Version 1.1.
    • Added paragraph about appearance streams.
  • September 25, 2004: Version 1.2.
    • Now supports linearized files.
    • Now supports inherited fields.
    • Uses NAnt.
    • Uses log4net.
  • October 01, 2004: Version 1.3.
    • Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
    • Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here