Introduction
PdfView is a utility that displays the structural elements of a PDF document. Since its inception in 1993, PDF has gained popularity as the format for exchange of electronic documents and forms. It is possible to create a well-formed PDF using a text editor. Simplicity of the format enables developers to create PDF documents using in-house solutions, without resorting to any external toolkits. The problem is, it becomes difficult to traverse within that document you have created after a while, due the format's hierarchical structure and the common use of indirect references within its objects. What's more, most PDF documents are a mixture of text and binary data. PdfView utility tries to address that problem and makes it possible to traverse within the PDF document tree visually.
Background
Portable Document Format (PDF) is a file format developed by Adobe Systems for representing documents in a manner that is independent of the original application software, hardware, and operating system used to create those documents. A PDF file can describe documents containing any combination of text, graphics, and images in a device independent and resolution independent format. These documents can be one page or thousands of pages, very simple or extremely complex with a rich use of fonts, graphics, colour, and images. PDF is an open standard, and anyone can write applications that can read or write PDFs royalty-free.
Main PDF concepts
PDF supports seven basic types of objects: booleans, numbers, strings, names, arrays, dictionaries, and streams. Booleans, numbers, and strings are simple values. As they are not nested, PdfView simply displays them as values (). An array () is a sequence of PDF objects. An array may contain a mixture of object types. A dictionary () is an associative table containing pairs of objects. The first element of each pair is called the key and the second element is called the value. The key must be a name. A value can be any kind of object, including a dictionary. A stream () consists of a dictionary that describes a sequence of characters, followed by the keyword stream
, followed by zero or more lines of characters, followed by the keyword endstream
. Since streams are basically binary blobs, PdfView just ignores and skips stream blocks. An indirect reference () is a reference to an indirect object, and consists of the indirect object's object number, generation number, and the R
keyword. The cross reference table contains information that permits random access to indirect objects in the file, so that the entire file need not be read to locate any particular object.
The trailer enables an application reading a PDF file to quickly find the cross reference table and certain special objects. Applications should read a PDF file from its end. The trailer dictionary is near the very end of the PDF document. It is the root of a PDF object tree.
Using the code
PdfView is a typical MFC Document/View application. It is a utility in itself, and the code within was not intended to be reused in other applications. However, let me summarize the main classes:
CBRawPdf
: This class stores the currently displayed file as a byte array. CBPdf
uses it to traverse within that byte array. The class has no information of the higher level PDF structures like dictionaries, arrays and cross reference tables. It performs navigational tasks such as getting the next/previous token/line.
CBPdf
: This class deals with the higher level structure of the PDF. It uses CRawPdf
to traverse within the document. It can render a PDF file in a tree or a rich text control.
CBPdfValue
, CBPdfReference
, CBPdfArray
, CBPdfDictionary
, CBPdfStream
: Each one of these classes stores a type of PDF object, namely values, references, arrays, dictionaries, and streams. All are derived from the same base class, CBPdfObject
.
Graph visualization of PDF objects
Optionally, the utility enables you to create a relational graph of the objects within the PDF file. For this, it needs Graphviz.
Graphviz is an open source graph visualization software. It has several main graph layout programs. The Graphviz layout programs take descriptions of graphs in a simple text language, and make diagrams in several useful formats such as images and SVG for web pages, postscript for inclusion in PDF or other documents; or display in an interactive graph browser.
After you open a PDF file using the utility, you can create a Graphviz compatible text file by selecting "File | Save As Dot File...". After that, the following command converts that text file to an image file:
dot.exe -Tgif pdf.dot -o pdf.gif
which gives you an image similar to the following one:
It is important to note that large PDF files have thousands of objects. Naturally, Graphviz cannot cope with these files, as the output image file tends to be huge. To prevent this, I have hard-coded a maximum limit of 250 objects into the utility. Experienced users can remove that limit, simplify the generated text file by removing the objects that are not needed in the graph and then create the image file.
A final note
Since there are dozens of PDF generators, there are probably some PDF documents that this utility cannot parse correctly. If you e-mail me a link to these documents, I can update the utility to support these documents as well.
History
- 07th October, 2005: Version 1.1 (Graph visualization)
- 25th September, 2005: Version 1.0